Search logs: #osdev2 - 5 March 2023

channel logs for 2004 - 2010 are archived at http://tunes.org/~nef/logs/old/ ·· can't be searched

#osdev2 = #osdev @ Libera from 23may2021 to present

#osdev @ OPN/FreeNode from 3apr2001 to 23may2021

all other channels are on OPN/FreeNode from 2004 to present

http://bespin.org/~qz/search/?view=1&c=osdev2&y=23&m=3&d=5

Sunday, 5 March 2023

00:00:00 <heat> most europeanest sport
00:00:00 <theWeaver> it's fucking cringe tho
00:00:00 <heat> why
00:00:00 <theWeaver> because of danish people i guess
00:00:00 <heat> you have a ball, dribble, jump, throw it towards the goal, hopefully score
00:00:00 <theWeaver> yeah but you let danes play and danes are cringe
00:00:00 <heat> although AFAIK most of them wear pants which is a certified cringe
00:01:00 <heat> sports should be played with shorts and not pants
00:01:00 <theWeaver> are we talking british english pants or american english pants
00:01:00 <heat> american
00:01:00 <theWeaver> idk why i asked, both options would be pretty cringe
00:01:00 <FireFly> theWeaver: well is like, cricket the only noncringe sport or what :p
00:02:00 <theWeaver> FireFly: no, as I said, cricket is fucking cringe as fuck
00:02:00 <theWeaver> re-read the scrollback pls
00:02:00 <FireFly> oops
00:02:00 <heat> chad sport: gymnastics
00:02:00 <heat> which one? all of it
00:02:00 <theWeaver> rn the only non-cringe sport i can think of is basketball
00:02:00 <theWeaver> basketball is just straight baller
00:02:00 <theWeaver> no question
00:03:00 <FireFly> pretty sure danish people do basketball too :P
00:03:00 <heat> nba basketball is cringe, ncaa basketball is cringier
00:03:00 <theWeaver> basketball is so cool even danes can't make it cringe
00:03:00 <heat> all american sports are just poor excuses for ad breaks
00:03:00 <theWeaver> heat: your mum is cringe
00:03:00 <FireFly> ok what about floorball? :p
00:03:00 <heat> no urs is cringe
00:04:00 <theWeaver> heat: yeah she's pretty fuckin cringe
00:04:00 <theWeaver> doesn't change the fact, your is s too
00:04:00 <heat> OH hockey is legit
00:04:00 <theWeaver> actually roller derby is non-cringe
00:04:00 <FireFly> oh yeah
00:04:00 <FireFly> roller derby's cool
00:04:00 <theWeaver> roller derby is mega cool
00:05:00 <theWeaver> it's the most lesbian sport ever and it rules
00:05:00 <heat> you know what's REALLY cringe?
00:05:00 <theWeaver> heat: what
00:05:00 <heat> snooker and cycling
00:05:00 <heat> old people sports
00:05:00 <theWeaver> idk i feel like golf is even worse
00:05:00 <gog> golf is the wrost
00:05:00 <gog> golf is evil
00:05:00 <theWeaver> but snooker and cycling are pretty cringe yeah
00:06:00 <heat> snooker is like the only sport where the top 5 is composed by old, balding british men
00:06:00 <heat> with a belly ofc
00:06:00 <FireFly> billiard is fine in a casual setting :p
00:06:00 <theWeaver> pool is cool
00:06:00 <theWeaver> snooker is cringe
00:06:00 <FireFly> oh idk the precise differences
00:07:00 <theWeaver> FireFly: if it helps i'll repeat it
00:07:00 <theWeaver> the cool one is pool
00:07:00 <theWeaver> the cringe one is snooker
00:07:00 <FireFly> :p
00:07:00 <heat> chess boxing is mega cringe
00:07:00 <heat> just nerds trying to be cool
00:07:00 <theWeaver> ... what the fuck is chess boxing
00:07:00 <FireFly> that's fine :p
00:07:00 <heat> it's exactly what it sounds like
00:08:00 <FireFly> theWeaver: alternating short rounds of boxing and chess until either checkmate or knockout
00:08:00 <theWeaver> i can't conceive of how you even mix those two
00:08:00 <theWeaver> FireFly: what
00:08:00 <theWeaver> are you serious
00:08:00 <FireFly> the idea being that you probably make poorer moves after having been beaten for a bit
00:08:00 <FireFly> yes :p
00:08:00 <FireFly> idk people are allowed to do silly things
00:08:00 <FireFly> I don't mind :p
00:09:00 <theWeaver> i'm not saying they're not allowed
00:09:00 <theWeaver> i just don't really get it
00:09:00 <theWeaver> but then people do all sorts of stupid shit they shouldn't be allowed to that i do get
00:09:00 <FireFly> like what?
00:10:00 <heat> sniffing glue
00:11:00 <theWeaver> voting for tories
00:11:00 <theWeaver> wait shit
00:11:00 <theWeaver> no not that one, i don't get that one
00:11:00 <heat> hahaha
00:11:00 <theWeaver> (definitely stupid and shoudln't be allowed tho)
00:11:00 <heat> labour's now-anual how-to-lose-an-election is definitely a sport
00:12:00 <heat> annual*
00:12:00 <FireFly> "The goal of eight-ball, which is played with a full rack of fifteen balls and the cue ball, is to claim a suit (commonly stripes or solids in the US, and reds or yellows in the UK), [...]" wait what, red/yellow instead of solid/striped o.o
00:12:00 <theWeaver> someone needs to punch Keith in the dick and make him resign
00:12:00 <theWeaver> useless motherfucker
00:12:00 <heat> who's keith?
00:12:00 <theWeaver> keir starmer
00:13:00 <FireFly> oh yeah rainy island politics is very weird
00:13:00 <heat> not-a-sport: chess, poker
00:13:00 <FireFly> from what I gather from the occaisonal updates I hear of it
00:14:00 <heat> also darts are cringe
00:14:00 <theWeaver> FireFly: yeah must be strange for someone like you who lives in a partially sane country and comes from a halfway decent one
00:16:00 <heat> britain is relatively sane
00:16:00 <theWeaver> relative to what
00:16:00 <theWeaver> the USA?
00:16:00 <heat> britain is only "omg totally insane lads can't take this" to british people
00:16:00 <theWeaver> beacuse if so yeah kinda but that's a dangerously low bar
00:17:00 <heat> 3/4 of europe at least
00:17:00 <theWeaver> i've yet to find a european country that was more fucked up than the UK
00:17:00 <theWeaver> france maybe
00:19:00 <heat> france, italy, germany, portugal, spain, all of eastern europe
00:19:00 <theWeaver> fuck off, germany is definitely saner
00:19:00 <theWeaver> spain too
00:19:00 <heat> spain is not sane
00:19:00 <FireFly> not sure I think france is more fucked up tan the UK tbh
00:19:00 <FireFly> heat: that was not the claim :p
00:20:00 <heat> spain has like 3 separatist movements going on at the same time
00:20:00 <mjg> lol @ your separatist movements
00:21:00 <mjg> in poland there are at least 3 separate monarchist movements
00:21:00 <mjg> one of them has a self-proclaimed regent
00:21:00 <mjg> apart from all of this there was a self-proclaimed king
00:21:00 <mjg> i don't even know who to bow to anymore
00:21:00 <heat> mjg for king
00:22:00 <mjg> i would teachy limitations of big O in elementary school
00:22:00 <mjg> feel the tyrranny
00:23:00 <heat> history is just unix geezers who wrote PESSIMAL code
00:23:00 <mjg> there are also non-unix geezers who did the same thing man
00:23:00 <heat> and readings of git blame
00:23:00 <theWeaver> tbh, politics be fucked up
00:23:00 <mjg> old unix geezer is the today's webdev
00:23:00 <FireFly> "remember that asymptotic behaviour doesn't necessarily specialise to a specific choice of n, kids" "..can we learn about multiplication now, teacher?"
00:24:00 <mjg> FireFly: "mention balancing your checkbook again. i dare you, i double dare you motherfucker"
00:24:00 <theWeaver> mjg: what?
00:24:00 <heat> theWeaver, germany has insane politics. like late stage UK politics
00:24:00 <mjg> theWeaver: what what
00:24:00 <theWeaver> mjg: what what, in the butt?
00:24:00 <FireFly> germany has dumb politics but doesn't the UK have even more of that? :p from my POV at least
00:24:00 <mjg> theWeaver: no propositioning on a sunday
00:25:00 <FireFly> I mean hey rainy island even went full brexit
00:25:00 <mjg> FireFly: what's the german equivalent of farage?
00:25:00 <theWeaver> mjg: ooookaaaaaaaaaaaay
00:25:00 <heat> UK has a two party system with one party dominance and a bunch of smaller, separate parties
00:25:00 <heat> germany has CDU
00:26:00 <theWeaver> heat: CDU is still not as bad as the tories
00:26:00 <FireFly> in theory it's a SPD/CDU two-major-parties one-on-each-side system though, no?
00:26:00 <geist> okay... so
00:26:00 <FireFly> lessee
00:26:00 * geist points to the stay-on-topic sign
00:26:00 <FireFly> ..reasonable yes
00:26:00 <theWeaver> there's a topic in this channel?
00:26:00 * theWeaver just uses #osdev to shitpost in
00:26:00 <geist> we really should make a #osdev-offtopic channel
00:26:00 <geist> yeah please dont
00:27:00 <mjg> so fun fact: a naive sse2 memcpy *loses* to naive movs for sizes up to ~24
00:27:00 <mjg> the one found in bionic
00:27:00 <heat> i mean wasn't #offtopia #osdev-offtopic?
00:27:00 <mjg> [note: no sse used below 16]
00:27:00 <geist> heat: i didn't parse that sentence
00:27:00 <heat> i think current #offtopia was #osdev-offtopic a few years ago
00:28:00 <geist> well, may not have survived the move to libera
00:28:00 <mjg> what on earth is #Offtopia
00:28:00 <geist> yah i dunno what that is
00:28:00 <heat> it's an offtopic channel with a bunch of #osdev people in it
00:28:00 <mjg> what's the signal:noise ratio on that one
00:29:00 <heat> anyway re: memcpy, bionic sse2 string ops aren't that great
00:29:00 <mjg> i can tell you a well guarded secret: freebsd developer channel is named #sparc64, which is extra funny ow that the arch is not longer supported
00:29:00 <mjg> heat: agreed
00:29:00 <geist> surprised it tried to use sse on anything smaller than say 64 bytes or so
00:30:00 <heat> blind rep movsb can beat its fancy sse2 memcpy
00:30:00 <heat> in fact, it mostly does
00:30:00 <geist> mjg: seems like a pretty good way to avoid lookyloos
00:30:00 <mjg> i'm pretty sure glibc uses simd as soon as it can
00:30:00 <geist> but anyway bionic not having an optimized x86 is probably not that surprising, considering android on x86 is not that big of a thing
00:30:00 <mjg> so either 16 or 32 depending on instruction set
00:30:00 <heat> geist, it does, contributed by intel
00:30:00 <geist> exactly.
00:31:00 <mjg> quite frankly i would expect that code to be slapped in from whatever internals they had
00:31:00 <geist> and then probably promptly dropped on the floor
00:31:00 <mjg> probably shared with glibc to some extent at the time
00:31:00 <heat> glibc's string ops code is nuts
00:31:00 <mjg> so it's not like it was coded up by an intern
00:31:00 <heat> the way they have it, avx512 code is just avx code which is just sse code but all with different params
00:32:00 <mjg> these ops have hardcoded parameters for one uarch
00:32:00 <mjg> i'm guessing the asm over there is what glibc would have used at the time for said arch
00:32:00 <mjg> extracted from entire machinery
00:33:00 <geist> so here's a question: microarchitecturally speaking is it *always* a good idea to have the bestest fastest possible memcpy
00:33:00 <geist> ie, given that you have a cpu that has potentially a lot of things in flight, or prefetching this and that
00:33:00 <mjg> but what makes the besterest mate
00:33:00 <geist> does shoving through the maximum amount of data through the memory subsystem per clock negatively affect other things that may be going on at the same time
00:33:00 <heat> mjg, btw i have experimentally verified that the avx2 memset is really solid
00:33:00 <geist> or even competing against other hyperthreads
00:33:00 <geist> well, i'd say bestest as in 'maximum number of bytes/clock'
00:33:00 <heat> it doubles the bandwidth of sse2 memset
00:34:00 <mjg> geist: that's a funny one
00:34:00 <heat> so if you had avx memcpy you could potentially also do double
00:34:00 <geist> i'm sure the answer is probably yes, but there are i'm sure sometimes downsides, kinda like how you mentioned the clzero on AMD can saturate one socket
00:34:00 <heat> and that's more or less what glibc does too
00:35:00 <mjg> geist: short answer is 'who the fuck knows', in practice people go for the fastest possible
00:35:00 <mjg> i will note all benchez i had seen do stuff in isolation
00:35:00 <geist> yah i mean in lieu of anything else, fastest > not fastest
00:35:00 <geist> hypothetically a fast memcpy competes negatively with SMT pairs that are off running 'regular' code at the same time, for example
00:35:00 <mjg> i also have to note that glibc has tons of uarch-specific quirks
00:36:00 <mjg> in its string ops
00:36:00 <mjg> thus i found it plausible they damage control concerns like the above
00:36:00 * geist nods
00:36:00 <mjg> to the point were yu end up with a net win
00:36:00 <mjg> where
00:37:00 <mjg> that asid, personally i don't have good enough basis to make a definitive statement on the matter
00:37:00 <geist> yah was more of a thought experiment than anything else
00:37:00 <geist> one for which there isn't a solid answer probably
00:37:00 <mjg> i'll note one common idea is to roll with nt stores
00:37:00 <geist> or there is an answer, microarchitecturally, in very specific situations but in aggregate it is a win
00:37:00 <mjg> past certain size
00:37:00 <mjg> which is already a massive 'who knows'
00:38:00 <geist> yah that probably helps. i assume for exampe that NT stores dont chew up lots of write queue units
00:38:00 <mjg> the folks at G had the right idea with their automemcpy paper
00:38:00 <mjg> instead of hoping for generic 'least bad everywhere', they created best suited for the workload
00:38:00 <geist> if the cpu can track say 16 outstanding writes, and some memcpy comes along and fills at 16, then you probably have to wait for all the previous ones to finish
00:38:00 <geist> or maybe NT stores only use one at a time
00:38:00 <geist> while the other writebacks can finish
00:39:00 <mjg> well really depends when you start doing them
00:39:00 <geist> and similarly if the memcpy is slamming the load units, then the cpu may not internally compete well with it for preemptive reads
00:39:00 <mjg> past some arbitrary threshold or perhaps when you know the total wont fit the cache
00:40:00 <mjg> ultimately ram bandwidth is infinite either
00:40:00 <mjg> not*
00:40:00 <mjg> as usual the win is to not do memcpy if it can be helped :p
00:41:00 <geist> re: the discussion of SMT static vs dynamic assignment of resources, iirc the ryzen at least has at least some amount of static assignment in the load/store units, i think
00:41:00 <geist> so maybe they avoided the problem by chopping it in half there
00:41:00 <geist> not so much the load/store units, but the amount of outstanding transactions i thik
00:42:00 <mjg> so i may be able to get a sensible data point soon(tm). as noted few times freebsd libc string ops don't use simd, but i can plop in some primitives and see what happens
00:42:00 <heat> ohhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh the llvm-libc string ops are their (automemcpy)
00:42:00 <heat> theirs*
00:43:00 <mjg> btw facebook has simd memcpy and memset apache licensed
00:43:00 <mjg> https://github.com/facebook/folly/blob/main/folly/memcpy.S
00:43:00 <bslsk05> github.com: folly/memcpy.S at main · facebook/folly · GitHub
00:44:00 <mjg> interestingly this bit:
00:44:00 <mjg> .L_GE2_LE3: movw (%rsi), %r8w movw -2(%rsi,%rdx), %r9w movw %r8w, (%rdi) movw %r9w, -2(%rdi,%rdx)
00:44:00 <mjg> erm
00:44:00 <mjg> .L_GE2_LE3: movw (%rsi), %r8w movw -2(%rsi,%rdx), %r9w movw %r8w, (%rdi) movw %r9w, -2(%rdi,%rdx)
00:44:00 <mjg> sigh
00:44:00 <heat> that's sick
00:44:00 <mjg> welp you see which one
00:44:00 <mjg> this bit used to suffer some uarch bullshit penalty
00:44:00 <mjg> i don't know about today
00:44:00 <moon-child> 'in some implementations they're the same function and legacy programs rely on this behavior' sigh
00:45:00 <moon-child> penalty for what?
00:45:00 <geist> hah i love how that folly project doesn't even attempt to architecturally isolate
00:45:00 <mjg> yu had to movzwl to dodge it
00:45:00 <geist> guess facebook dont do arm
00:45:00 <mjg> geist: given their commens looks like they only do skylake :p
00:45:00 <geist> yah
00:45:00 <mjg> moon-child: partial register reads
00:46:00 <mjg> moon-child: this guy: movw (%rsi), %r8w and this guy: movw -2(%rsi,%rdx), %r9w
00:46:00 <mjg> moon-child: erm, stores to
00:46:00 <moon-child> oh right yes
00:46:00 <mjg> i would not be surprised if this was still a problem
00:46:00 <mjg> there is so much bullshit to know of it is quite rankly discouraging to try anything
00:47:00 <geist> yah really helps to only worry about one microarchitecture (skylake-x)
00:47:00 <mjg> guaranteed fuckUarch suffers a penalty in a stupid corner case you can't hope to even test
00:47:00 <geist> that's the luxury the big companies can generally rely on
00:47:00 <geist> then everyone scrambles when some new uarch comes along, but that's job security
00:47:00 <moon-child> I think at one point it was able to rename the low bits differently, but only when it knows the high bits are zeroed? But then maybe at some point they walked back on that as not worth the effort, since no one actually uses small registers?
00:47:00 <moon-child> don't remember
00:47:00 <mjg> there was some bullshit how if buffers differ in a multiply of page size there is a massive penalty
00:48:00 <mjg> copying forward
00:48:00 <mjg> like off cpu
00:48:00 <mjg> buffer addresses
00:48:00 <moon-child> oh yess cache associativity stuff
00:48:00 <moon-child> that's a thing pretty much everywhere
00:48:00 <moon-child> apparently it's too expensive to put even a really dumb hash in front of l1
00:48:00 <mjg> and ERMS *backwards* is turbo slow
00:48:00 <heat> rep movsb backwards is not ERMS
00:48:00 <heat> it's explicitly stated
00:49:00 <mjg> ye ye
00:49:00 <mjg> point is
00:49:00 <mjg> another lol corner case
00:51:00 <heat> mjg, what's an overlapping copy
00:51:00 <heat> in memcpy implementation terms, overlapping stores or whatever
00:51:00 <heat> i don't get it
00:51:00 <mjg> copying a buffer partially onto itself
00:51:00 <moon-child> suppose you wanna copy 7 bytes
00:51:00 <moon-child> you copy the first 4 bytes, and the last 4 bytes
00:51:00 <moon-child> these have a one byte overlap in the middle, but it doesn't matter, because it's the same byte
00:52:00 <heat> 1) how 2) why is this faster
00:52:00 <moon-child> this lets you handle 4, 5, 6, 7, 8 bytes in one path with no branches
00:52:00 <mjg> right
00:52:00 <mjg> the overlapping stores suffer a pentalty to some extent
00:52:00 <mjg> but it is cheaper than branching on exact size
00:53:00 <moon-child> char *dst, *src; size_t n. if (4 <= n <= 8) { int lo4 = *(int*)src, hi4 = *(int*)(src+n-4); *(int*)dst = lo4; *(int*)(dst+n-4) = hi4; }
00:53:00 <mjg> the .L_GE2_LE3: has a simple example
00:55:00 <mjg> what breaks my heart is that agner fog recommends a routine which does *NOT* do it
00:55:00 <mjg> 's liek wtf man
00:55:00 <mjg> that was my first attempt and it sux0red
00:56:00 <moon-child> don't meet your heroes
00:56:00 <mjg> :]
00:56:00 <mjg> you are too young to truly know that mate
00:58:00 <heat> ok I understand the simple overlapping stores thing
00:59:00 <heat> how do I use that to write a fast GPRs-only memcpy
00:59:00 <mjg> i'm afraid see bionic memset for sizes up to 16
01:00:00 <mjg> that covers that range
01:00:00 <moon-child> https://cgit.freebsd.org/src/tree/lib/libc/amd64/string/memmove.S or so? :)
01:00:00 <bslsk05> cgit.freebsd.org: memmove.S « string « amd64 « libc « lib - src - FreeBSD source tree
01:00:00 <mjg> moon-child: funny you mention it, i'm gonna do some fixups over there :]
01:00:00 <mjg> for example rcx is written to for hysterical raisins
01:01:00 <heat> what's wrong with your label names
01:01:00 <mjg> what happened was the original code was rep movs only, that uses rcx for the count
01:01:00 <mjg> heat: sad necessity for macro usage
01:01:00 <mjg> heat: i don't like it but did not want to waste more life arguing
01:02:00 <heat> its like you obfuscated the code lol
01:02:00 <mjg> the end result is that this is plopped into copyin/copyout
01:02:00 <mjg> as in the same code shared
01:03:00 <mjg> if there is a nice way to do it i would be happy to change it
01:03:00 <heat> namespace your label names?
01:03:00 <mjg> note it does not inject jumps or whatever to do the actual copy, it's all *in* routines
01:03:00 <heat> .Lmemmove_0_to_8:
01:03:00 <mjg> heat: again macroz, names just don't work, but maybe there is a work around for it
01:03:00 <mjg> mate that's what i started with
01:03:00 <mjg> :]
01:03:00 <heat> why don't they work?
01:04:00 <mjg> i don't rmeember the error, but it craps out
01:04:00 <mjg> it's been like 5 years since i wrote it
01:04:00 <moon-child> maybe can do token pasting or something?
01:04:00 <moon-child> it generates two versions one with erms and one without, so duplicate label names right?
01:04:00 <mjg> again i don't remember what happened
01:05:00 <mjg> here is what does matter right now:
01:05:00 <mjg> 1. rep sucks
01:05:00 <mjg> 2. there is no speed up from making a loop bigger than 32 bytes per iteration
01:05:00 <mjg> 3. it's all sad tradeoffs
01:06:00 <mjg> 4. did i mention rep sucks?
01:06:00 <heat> doesn't rep unsuck on large sizes?
01:06:00 <mjg> it most definitely does not
01:06:00 <mjg> except vast majority of all calls are for sizes for which it *does* suck
01:06:00 <heat> also, what happens if you pad with int3's instead of nops?
01:07:00 <mjg> https://people.freebsd.org/~mjg/bufsizes.txt
01:07:00 <mjg> afair some uarchs don't like that
01:07:00 <mjg> as in they get a penalty
01:08:00 <heat> i know ret; int3 does, as its used to stop SLS
01:08:00 <mjg> look mate, it is over 2 am here, i'm heading off
01:09:00 <mjg> that memmove.S has some tradeoffs which maybe aren ot completely defensible, and defo has some crap which i'm gonna fix
01:09:00 <mjg> i would say look at bionic
01:09:00 <mjg> :]
01:09:00 <mjg> for < 16 bytes
01:09:00 <mjg> hue movb(%rsi),%dl
01:09:00 <mjg> partial fucker not taken care of
01:09:00 <geist> https://gcc.godbolt.org/z/4q58c1sj7 pretty strange looking at the codegen in tehre, gcc does some weird stuff right in the middle of that 8 byte loop
01:09:00 <mjg> gonna movzbl it
01:10:00 <geist> ie, at .L5
01:10:00 <geist> it seems to add 8 to the in var (a5) but it recomputes the dest var (a3) by adding the in + some precomputed delta between
01:10:00 <geist> very strange, the more logical code is to just add 8 to a3
01:11:00 <geist> but it does do the logical thing and compute the max address and do a bne against that, instead of subtracting one from a counter, like the code is written
01:11:00 <mjg> heat: maybe i expreswsed myself poorly. rep is great for "big enough" sizes, but said sizes are rarely used in comparison to sizes for which it is terrible
01:11:00 <geist> since riscv has nice compare two regs and branch instructions
01:12:00 <heat> yes
01:12:00 <mjg> heat: i'm off
01:12:00 <heat> bye
01:12:00 <geist> clang compiles this code as written, no weird optimizations.
01:12:00 <geist> even puts the store right after the load :(
01:12:00 <heat> i'll either write an ebpf thing or write a memcpy tonight, maybe both
01:12:00 <heat> ideally none
01:19:00 <geist> yeeeesssss
01:53:00 <geist> been piddling with it, and FWIW glibc currently has no specially optimized string routines for riscv
01:54:00 <geist> but the default C version does a fairly good job
01:55:00 <heat> yes, it doesn't
01:56:00 <heat> also if you looked at the source you'll see that the generic memcpy has page moving capabilities because of hurd
01:56:00 <heat> :v
02:01:00 <heat> haha, fun fact: GAS .align N, 0x90 actually picks smart nop sizes on x86
02:23:00 <heat> ok memcpy done
02:23:00 <heat> not hard
02:23:00 <heat> mjg will probably shoot it full of holes tomorrow but i'm relatively satisfied
02:43:00 <geist> yeah, i'm doign kinda the same thing
02:44:00 <geist> have a reasonably tight 8 byte at a time riscv memset working
02:44:00 <geist> not really any better than the compiler could do with similiar C code, but it's nicer to read and commented
02:45:00 <geist> drat, still gets trounced by glibcs version which unrolls it to 64 byte
02:47:00 <heat> i should check if 64 byte makes a difference here
03:05:00 <heat> no, it doesnt
03:30:00 <moon-child> y'all have riscv hardware?
03:30:00 <geist> i do, now
03:33:00 <geist> woot. my new asm memset now matches or beats glibc
03:34:00 <heat> better share it comrade
03:34:00 <heat> give us our new memcpy
03:34:00 <geist> yah hang on a sec
03:34:00 <geist> memcpy is next, but probably wont get to that tonight
03:34:00 <geist> memset is to just warm up, get used to writing riscv asm in large amounts. there's kinda a zen to it
03:35:00 <heat> yeah im not really comfortable doing it
03:35:00 <heat> for any risc really
03:38:00 <geist> https://gist.github.com/travisg/7c6b5494990162241a8f590fb2cb06ba
03:38:00 <bslsk05> gist.github.com: riscv memset · GitHub
03:38:00 <geist> may still be bugs, but it passes my test harness
03:39:00 <heat> // TODO: see if multiplying by 0x1010101010101010 is faster
03:39:00 <heat> is it?
03:39:00 <geist> dunno!
03:39:00 <geist> the hard part is getting the constant into the register, which requires a load
03:40:00 <geist> but when i ran the expand logic into godbolt gcc juse does the mul
03:40:00 <geist> https://gcc.godbolt.org/z/r3Wa6ov4T
03:40:00 <heat> yeah
03:41:00 <geist> clang does some even weirder shit: https://gcc.godbolt.org/z/o46vsv59W
03:41:00 <heat> haha that's genius
03:41:00 <geist> i think it's actually doing the shift and add trick to synthesize the constant, and then mul it
03:44:00 <geist> there are a few other tricks i've seen the compiler do to rewrite 'store + add base reg + bne' to 'add base reg + store - 8 + bne'
03:44:00 <geist> though that's kinda debatable, because there still is a dep between incrementing the base reg and something
03:46:00 <heat> does any of that matter on your riscv cpu?
03:46:00 <geist> which part?
03:46:00 <heat> dependencies
03:47:00 <heat> can it do any ooo?
03:47:00 <geist> probably not, though the u74 is at least dual issue so i think there are some deps
03:51:00 <geist> but there's definitely a huge win to unrolling the inner loop on this thing. to the tune of 10GB/sec vs about 3.5
03:53:00 <geist> though that's only when in the L2 cache range (<2MB). once you get past that it settles in to what is apparently bus speed, which seems to be around 800MB/sec
03:55:00 <heat> that's pretty fast
04:02:00 <geist> yeah this is an actually kinda reasonable cpu. it seems to more or less perform as i expect for this class of core.
05:19:00 <sham1> mrvn: RE: commenting on adding a tagged integer. I'd expect there to be a macro or something for that. That is, to make a C integer into an OCaml one
07:40:00 <zzo38> :Here I wrote some ideas I have about operating system design: http://zzo38computer.org/fossil/osdesign.ui/dir?ci=trunk&name=design (Now you can complain about it, and/or other comment)
07:40:00 <bslsk05> zzo38computer.org: Unnamed Fossil Project: File List
08:26:00 <kof123> "does shoving through the maximum amount of data through the memory subsystem per clock negatively affect other things" anyone can make a faster cpu, the trick is to make a fast system -- cow tools guy ^H^H^H^H^H^H^H cray
08:28:00 <AmyMalik> the trick to making a fast CPU useful is to keep the fast CPU fed
08:28:00 <AmyMalik> bandwidth, latency, and actually having tasks you need done
08:43:00 <zid> hence cpu caches, hence speculative execution, hence, hence
08:48:00 <sham1> Hence Spectre
08:48:00 <sham1> And hence a nice James Bond movie
08:48:00 <zid> yes, the pinnacle of cpu design, spectre
08:48:00 <zid> The true end goal
08:49:00 <GeDaMo> I'm just reading about AMD's 3D cache https://www.tomshardware.com/news/amd-unveils-more-ryzen-3d-packaging-and-v-cache-details-at-hot-chips
08:49:00 <bslsk05> www.tomshardware.com: AMD Unveils More Ryzen 3D Packaging and V-Cache Details at Hot Chips | Tom's Hardware
08:49:00 <sham1> Well, you go so fast as to break security. At what point can we start saying that CPUs are fast enough
08:50:00 <sham1> We need both horizontal and vertical scaling
08:50:00 <zid> GeDaMo: You haven't bought one yet?
08:50:00 <GeDaMo> You know I haven't :|
08:50:00 <zid> Annoyingly AMD have done that thing where the most useful config is the cheapest model, so gets the worst silicon
08:50:00 <zid> my friend just did
08:51:00 <zid> so we've been playing with it
08:52:00 <GeDaMo> Is fast? :P
08:53:00 <zid> yep, tis fast
09:39:00 <netbsduser> zzo38: it sounds very mainframey
09:41:00 <netbsduser> the record-based files especially, and echoes of IBM i in the object stuff
09:41:00 <gog> hi
09:42:00 <netbsduser> gog: well come
09:43:00 <lav> hii
09:44:00 * gog patpatpatpat lav
09:44:00 <lav> ee
09:44:00 * lav purrs
09:48:00 <zid> I'm swearing off unix for being too woke, I did ls / and what do I see? Libs.
09:53:00 <lav> It's a little-known fact that Qt actually stands for Queer & transgender
09:53:00 <zid> and kde is.. kaleidoscope of dicks everywhere?
09:54:00 <lav> yup
09:57:00 <gog> i'm a qt qt
09:57:00 <gog> fr fr
09:57:00 <lav> agreed
09:58:00 <Ermine> hi gog, may I pet you?
09:59:00 <gog> yes
09:59:00 * Ermine pets gog
10:00:00 * gog prr
10:03:00 <zid> gog: how sure are we that you're not just a sussy cis sissy?
10:05:00 <gog> you don't need to be sure of anything breh
12:27:00 <netbsduser> fuse seems to be completely antithetical to a sane VFS
12:28:00 <netbsduser> there seems to be no separation of the file from the vnode layer, so you end up with the most outrageous requirements, like requiring to pass FUSE_RELEASE the same flags that you FUSE_OPEN'd something with
12:30:00 <netbsduser> i will just add two opaque fields to kernel file descriptors + pass the kernel fd to vnode ops purely for the sake of this monstrosity, since (fuse being undocumented) i don't dare try to figure out how to route around the nuttiness
12:36:00 <netbsduser> another bit of stupidity, the root inode number is especially specified to be FUSE_ROOT_NODE, but (at least with virtiofs) its `.` and `..` entries are not! this would play havoc with the vnode cache
15:22:00 <netbsduser> virtiofs is half-baked
15:22:00 <mrvn> so you can write files but not read them to check if it actually worked?
15:24:00 <netbsduser> i've got a root dir with a subdir "test" in it. result of FUSE_LOOKUP on the root dir for `test` = fuse node number 3. result of FUSE_LOOKUP on that folder for `.` is fuse node number 13968. oh, and while the root dir's fuse-recognised number is actually `1`, FUSE_LOOKUP `..` in 'test' = 13955
15:30:00 <mrvn> is that maybe the inode of the mountpoint?
15:31:00 <mrvn> and what does stat on / say?
15:32:00 <mrvn> Does anyone actually use "." and ".." from the FS and not generate them internally?
15:32:00 <netbsduser> those are indeed the inode numbers of the underlying mountpoint, virtiofs lets them appear, so it seems you need to treat fuse inode numbers and the inode numbers from getattr or lookup of `.` or `..` as fundamentally different
15:32:00 <mrvn> sounds like a bug in virtiofs though.
15:33:00 <netbsduser> mrvn: i used `.` in a failed attempt to reduce the effort to map fuse/virtiofs semantics to my vfs
15:34:00 <mrvn> If virtiofs doesn't map . and .. properly then I would rather have it not contain them in readdir at all.
15:34:00 <netbsduser> i can abolish my use of `.` but i really do need `..` though, and so i think most would, since it's not as though i carry a `parent directory` pointer in vnodes
15:35:00 <netbsduser> maybe linux does, they are known to be "different" in this area
15:35:00 <mrvn> netbsduser: without parent directory pointer how will you implement "mount --bind"?
15:35:00 <mrvn> or get from the root of a mounted filesystem (e.g. /home) to the parent directory?
15:37:00 <netbsduser> mrvn: however the BSDs do nullfs, and by checking `vnodecovered` field of the vnode and then doing lookup `..` on that vnode, respectively
15:38:00 <mrvn> you can skip the parent pointer if you add a generated ".." entry into the vnode of every dir. But that's just another way to store the parent pointer.
15:40:00 <netbsduser> i only store details like that in the name lookup cache (well, i would if i *had* a name lookup cache, i am speaking aspirationally)
15:40:00 <netbsduser> i fear virtio-fs might be completely incompatible with anything other than the Linux VFS if i can't figure out some workaround that at least lets me get `..` lookup to return the right thing
15:40:00 <mrvn> so maybe implementing the name lookup cache will fix the problem.
15:43:00 <netbsduser> it could get me *somewhere*, but i would have to support pinning entries in the cache. the problem remains of people `mv`'ing a directory on the host elsewhere; then the stale entry is stuck, and i can never get the right entry because FUSE_LOOKUP `..` will give me the unusable host fs inode number instead of the fuse inode number
15:44:00 <mrvn> sure. But you are already stuck with bind mounts and crossing filesystem barriers in general. ".." simply isn't unqiue.
15:45:00 <mrvn> Now if you don't want bind mounts then you still need to pin normal mountpoints so you can fix the ".." at the filesystem boundary.
15:45:00 <mrvn> Note: there is also mount --move in linux.
15:47:00 <mjg> heat: don't be so easy on yourself
15:48:00 <mrvn> netbsduser: what happens with virtiofs when something moves a directrory on the host?
15:49:00 <mrvn> Do you end up with a "stale handle" error like with NFS?
15:49:00 <netbsduser> i am not convinced that it is necessary for bind mounts, at least not if they are equivalent to bsd nullfs (which mounts of subtree of the system fs into another point)
15:50:00 <netbsduser> and for the case of parent directory of mountpoints, that's handled especially by the vfs lookup function checking for a `vnodecovered` pointer in a vnode (meaning it's the root of an FS and occluding a vnode of another FS; then you lookup `..` of the vnodecovered to get the true parent)
15:50:00 <mrvn> netbsduser: you can bind mount /my/little/subdir /bla. Then ".." of /my/little/subdir is /my/little under /my/little/subdir and / under /bla. But both would be the same vnode, right?
15:50:00 <mrvn> Maybe vnodecovered also covers the bind mount case.
15:51:00 <sham1> This is why bind mounting is a bad idea
15:51:00 <sham1> You need to keep track of the full path you used to get to a place in order to properly do ..
15:51:00 <mrvn> I have no idea what code you are copying. I just know you need the .. stored explicitly for mountpoints somewhere.
15:51:00 <netbsduser> mrvn: on moving a directory on the host, i have no idea, it probably segfaults judging by my current experience with virtiofsd (it actually crashes every time qemu exits)
15:51:00 <sham1> Same with symlinks, which is why plan9 removed them
15:52:00 <mrvn> sham1: indeed. You need the full name lookup cache keeping the full path alive so you can follow the parent pointers.
15:53:00 <mrvn> And multiple cached names can point to the same inode.
15:53:00 <mrvn> .oO(But you have that with hardlinks already)
15:54:00 <mrvn> Another special case to keep in mind is chroot, or containers with a new FS namespace.
15:55:00 <sham1> Mm. I know that at least Serenity solves this by having essentially the file description remember the path it was accessed though, which is then cached so often used components are shared, and things like openat can then use these to do relative actions
15:55:00 <netbsduser> it appears that nullfs on netbsd at least creates new vnodes on-demand
15:56:00 <netbsduser> now dragonflybsd i remember boasts they need no such thing
15:58:00 <mrvn> is that something to boast about? It's not like it matters. Allocating a few vnodes is peanuts.
15:59:00 <sham1> We should just associate UUIDs with files
15:59:00 <sham1> Or GUIDs
15:59:00 <mrvn> too short.
16:00:00 <sham1> 128 bits is too short?
16:01:00 <sham1> Okay, then it can just be doubled. 256 bits
16:01:00 <mrvn> sham1: every directory you bind mount creates 2 paths to the file. So you need 1 bit to differentiate them. Do that 128 times and you have no bits left to specify the file in the dir at all.
16:02:00 <sham1> Ah
16:02:00 <sham1> So how many bits would a vnode need then
16:02:00 <mrvn> variable. it needs a parent pointer.
16:03:00 <mrvn> or you need a lookup from vnode ID to path
16:04:00 <mrvn> NFS runs into this problem because it's stateless. The client can't just throw some ID on the server because the server might not have the ID cached anymore and can't find the path. NFS handles do some magic to encode the path in some way but even that doesn't always work.
16:09:00 <netbsduser> my plan for now: fusefs_nodes will store their parent's node-id and that will be used to service `..` lookups
16:11:00 <netbsduser> all this stuff falls apart in the presence of the host moving the directory, but it sounds as though it falls apart rather nastily on linux too, so such is life
16:13:00 <mrvn> that's how all the union FSes in the kernel fall apart too, except a segfault in the kernel is worse. Only unionfs-fuse handles FS changes on the underlying FSes properly.
16:13:00 <mrvn> (well, without crashing, nothing you can do to fix it)
16:17:00 <netbsduser> i wonder on a related thing, how fuse deals with nfs, and this virtiofs fuse setup in particular
16:23:00 <mrvn> netbsduser: the fuse client has handles that attached to each dentry and if something on the server changes you get an error about stale handles.
16:24:00 <mrvn> fuse does nothing for NFS, doesn't even see NFS, only the vfs.
16:24:00 <mrvn> s/the fuse client/the nfs client/
16:25:00 <mrvn> remember: fuse filesystems are just user processes that access the filesystem though normal syscalls.
16:32:00 <zid> moon-child: it was you messing around with pointer tagging doubles and pointers together right?
16:48:00 <mrvn> What are pointer tagging doubles?
16:59:00 <mjg> heat: i'm gonna reap myself a new one for that memcpy i wrote :S
17:00:00 <mjg> heat: looks like we are gonna both going to get chew out on this one
17:01:00 <heat> mjg, does that memcpy suck?
17:02:00 <heat> i took it as an inspiration for mine
17:02:00 <mjg> suck -- no, but it has stupid perf bugs
17:02:00 <mjg> for example i did not take care of partial regs for 1 byte copy forward
17:02:00 <mjg> and for copies <= 4 bytes backward
17:03:00 <mrvn> mjg: memcpy does not handle overlapping
17:03:00 <mjg> mrvn: age old
17:03:00 <heat> ok question: 1) should you interleave loads and stores? 2) isn't this chain of branching a bad idea?
17:03:00 <mjg> look i need this for *memmove* as well, and there is overlap of code
17:03:00 <mrvn> true
17:04:00 <mjg> so my *memmove* has the above problem
17:04:00 <mrvn> heat: isn't that obsoleted by the cpu pipeline and out-of-order execution?
17:04:00 <sham1> So you'd use memmove for that overlapping code, obviously
17:04:00 <mrvn> (1)
17:04:00 <mjg> heat: normally you do all the loads first and stores later, then branch on whatever
17:04:00 <heat> like the 16 byte branch does e.g cmp $16, %len; jb .L8byte
17:04:00 <mjg> just show me your code
17:04:00 <heat> yeah
17:05:00 <heat> btw that lea trick is pretty cute
17:05:00 <mjg> i also note reg allocation is a little questionable, but it i had hysterical reasons
17:05:00 <mjg> you would learn lea trick if you checked disasm of any real code mate
17:05:00 <mjg> 's how i did it :p
17:05:00 <mjg> i also *suspect* the code which aligns to 16 bytes could be much better
17:06:00 <mjg> i'll try to express it in C and see what clang comes up with
17:06:00 <heat> https://gist.github.com/heatd/fe2c9a2d3a4ef04616d481ee6660c722
17:06:00 <bslsk05> gist.github.com: memcpy-gpr.S · GitHub
17:07:00 <mrvn> heat: you aren't checking for alignment.
17:07:00 <mjg> movb (%rsi), %cl
17:07:00 <mjg> movzbl
17:07:00 <mjg> that's one of the bugs
17:08:00 <mjg> movw (%rsi), %cx
17:08:00 <mjg> movzwl
17:08:00 <heat> aha riiight
17:08:00 <heat> let me guess, some uarchs have false dependencies on the rest of rcx?
17:09:00 <mjg> i don't remember what happens there, i do rmeember i did measure a slowdown from not doing it
17:09:00 <mjg> on haswell et al
17:09:00 <mjg> it may be kabylake no longer has the problem
17:09:00 <mjg> .L1_byte: missing ALIGN_TEXT?
17:09:00 <mjg> .Lerms: mov %rdx, %rcx rep movsb -- normally you want to align the buf at least to 16
17:09:00 <heat> I guessed it's stupid to have an ALIGN_TEXT there because it's a single byte memcpy
17:10:00 <mjg> then handle 1 byte early instead
17:10:00 <heat> in my logic, memcpy(1) is already stupidly pessimal anyway, no real reason to pad it early
17:10:00 <mjg> so there is one fundamental tradeoff in that code, which is not 100% defensible
17:10:00 <heat> s/early/at all/
17:11:00 <mjg> you can either roll with some branches upfront and jump once to the target code
17:11:00 <mjg> or you can have a cascade if you will, like in the code above
17:11:00 <heat> yes, that's part of my "to improve" ideas
17:11:00 <heat> bionic memmove.S branches upfront
17:11:00 <mjg> so the idea behind it was that sizes 32-256 consist of majority of the calls
17:11:00 <mjg> so it makes sense to make it the fastest
17:11:00 <mjg> hence fewer branches to get there
17:11:00 <mjg> in my code you slide into it
17:12:00 <mjg> as in no jumps to start
17:12:00 <heat> I added a branch to 16 because I noticed in your histogram that most memcpies were 16 byte long
17:12:00 <mjg> you added enough branches to perhaps defeat that
17:12:00 <heat> did I?
17:12:00 <mjg> again, this one is *super* murky
17:12:00 <mjg> i'm gonna do another take today or tomorrow
17:12:00 <heat> hmm ok
17:13:00 <mjg> generate more datasets, from freebsd and ilnux
17:13:00 <mjg> and then measure total time to execute them with both variants
17:13:00 <mjg> lower total time wins
17:13:00 <mjg> by dataset i mean collect all sizes along with the number of times they showed up
17:13:00 <heat> yes
17:14:00 <mjg> randomize the order
17:14:00 <mjg> and we will see on a bunch of cpus
17:14:00 <mjg> no claiming perfect, but should be good enough
17:14:00 <heat> is memmove just doing this but backwards?
17:14:00 <mjg> yes
17:15:00 <mjg> i needed to implement it because 'bcopy' which was the goto way
17:15:00 <mjg> used to be memcpy
17:15:00 <mjg> and then a geezer made it into memmove
17:15:00 <mjg> and now i'm screwed
17:15:00 <heat> is there a penalty to always copying backwards?
17:16:00 <heat> having two versions of the same code that do forwards/backwards sounds depressing
17:16:00 <mjg> i don't htink you can get away with that for arbitrarily stupid args
17:16:00 <heat> so I could have memcpy doing forwards and memmove doing backwards, that's my idea
17:16:00 <heat> hmmm ok
17:17:00 <mjg> anyhow i plan to sort out memset first
17:17:00 <mjg> same general issue + same idea what to d
17:18:00 <mjg> btw that 256 is lowballing it
17:18:00 <mjg> my haswell does better
17:19:00 <heat> is it? I think I tried higher and saw really mixed results
17:19:00 <mjg> there may be lullers on your arch which make it into a problem
17:19:00 <mjg> uarch
17:19:00 <mjg> again, fucking cpus man
17:20:00 <mjg> key though: rep movs is quite pessimal for short sizes, what you do about it is for the most part tradeoff city
17:21:00 <heat> do the same principles apply to SIMD memcpy too?
17:21:00 <zid> mjg do you say other words
17:21:00 <heat> except maybe SSE may have issues storing to unaligned addresses
17:21:00 <heat> I know AVX doesn't
17:21:00 <mjg> zid: my english is limited, i only got 'english for chronic complainers about perfomrnace' in school
17:21:00 <zid> makes sense
17:22:00 <zid> are you much more personable in polish
17:22:00 <mjg> of course, i'm a very well read person
17:22:00 <heat> peszimal
17:22:00 <mjg> heat: i don't know the realities for simd which i could 100% defend
17:22:00 <mjg> heat: i could give you a stackoverflow-quality answer
17:22:00 <heat> do it
17:23:00 <mjg> you wanna do overlapping stores as soon as you can
17:23:00 <mjg> but watch out how mcuh you do them for one set
17:23:00 <mjg> [relaity check: sse2 /sucks/ when you do it for certain sizes]
17:23:00 <mjg> i have 0 real data for avx
17:23:00 <mjg> i intentionally not checkd glibc memcpy s that i can implement my own if needed
17:24:00 <mjg> but preferably i would steal one with an adequate license
17:24:00 <mjg> it was quitea bummr to find how much bionic sucks here :/
17:24:00 <heat> folly has an avx2 one
17:25:00 <mjg> yes i linked it
17:25:00 <heat> yes i know
17:25:00 <heat> just saying, it's an option
17:25:00 <mjg> the problem is apache license would need some finesing
17:25:00 <mjg> also i did not bench it myself
17:25:00 <mjg> also see the automemcpy paper
17:25:00 <mjg> for all i now i can generate an ok memcpy without handrolling any asm
17:25:00 <mjg> which would be the bestest
17:27:00 <heat> linux memcpy_orig seems quite ok
17:28:00 <heat> could use the erms bit for lengthy copies but it seems similar to what we both have
17:28:00 <mjg> oh he, rolls with that jmp chain thing
17:29:00 <mjg> heh even
17:29:00 <mjg> /*
17:29:00 <mjg> * We check whether memory false dependence could occur,
17:29:00 <mjg> * then jump to corresponding copy mode.
17:29:00 <mjg> */
17:29:00 <mjg> cmp %dil, %sil
17:29:00 <mjg> jl .Lcopy_backward
17:29:00 <mjg> i don't know about this bit
17:29:00 <mjg> back then i talked to a big wig at intel about memcpy
17:30:00 <mjg> he told me to do address comparison and then do rep mov forwards or backwards
17:30:00 <mjg> et voila
17:30:00 <heat> lol
17:30:00 <mjg> no seriously
17:31:00 <mjg> the fact that their own optimization manual recommends against it
17:31:00 <mjg> did not phase him
17:31:00 <heat> against what?
17:31:00 <mjg> rep for short copies
17:31:00 <heat> the optimization manual seems to hail rep movsb as the best shit ever
17:31:00 <mjg> also note "fast short rep mov" making an appearance in recent years further proves it is crap
17:32:00 <mrvn> For a memcpy <= 32 byte isn't a simple 1byte copy loop faster than branching for 16, 8, 4, 2, 1 bytes?
17:32:00 <mjg> mrvn: nope. i had various experiments 5 years ago, including 8 byte loops etc
17:32:00 <heat> oh how does fsrm bench with this shit?
17:32:00 <mjg> it was all slower than overlapping stores
17:33:00 <mrvn> mjg: so a loop copying 8 byte that runs maybe twice is better? That's at least one branch misprediction.
17:33:00 <mjg> heat: afair fsrm does not help rep *stos*, it does help rep *movs*, but it is still slower for sizes < 64 or so
17:33:00 <mrvn> mjg: same for any remaining 4 byte and again 2 byte.
17:33:00 <mjg> mrvn: as noted previously i'm about to generate a good real-world-based dataset to memcpy and memset
17:33:00 <mjg> i'll hack up the above to the test mix
17:34:00 <heat> do you think I'll get shot if I try to patch memcpy to Be Good(tm)?
17:34:00 <mrvn> mjg: yes please. How many cpus can you test?
17:34:00 <mjg> mrvn: westmere, sandy, haswell, skylake, coffy lake, ice lake
17:34:00 <mjg> and probably some amd if i can get arsed
17:34:00 <mrvn> mjg: also will you benchmark real code? Replace the memcpy in libc and see what that says.
17:35:00 <mjg> see above for the description of what i intend to do
17:35:00 <mjg> i can easily get hands on more intels but i think that's enough
17:36:00 <mjg> would also be funnyt o bench no microcode updates vs fresh
17:36:00 <mjg> but i don't know if i can be arsed to get the former
17:36:00 <mrvn> I wish there where a way to mark different entry points into a function for the compiler. Like: enter here if src is 16 byte aligned, enter here if dst is 16 byte aligned, enter here if size > 64, ...
17:36:00 <mjg> heat: where? linux? it is a touchy subject so i would not
17:37:00 <mjg> heat: looks like the L guy and Boris SOmething are going to sort it out in a manner good enough(tm)
17:37:00 <mrvn> Sometimes I miss templates + constraints in C
17:37:00 <mjg> heat: for example i'm not gonna ship my memset over there :]
17:38:00 <heat> why is it a touchy subject?
17:38:00 <mjg> read the thread
17:38:00 <heat> you probably mean borislav petkov btw
17:38:00 <heat> mkya
17:38:00 <heat> mkay*
17:39:00 <mjg> yea
17:39:00 <mjg> also i guarantee there is something bad i don't even know about, which does affect the routine as coded by me
17:39:00 <mjg> and which some greyberad will point out as PESSIMAL
17:40:00 <mjg> while i welcome that, that's not the setting where i do
17:40:00 <mjg> :p
17:40:00 <heat> btw linux memmove is probably superior to memcpy ATM
17:40:00 <mjg> look i'm done chatting about bs, time to do some data collection
17:42:00 <heat> "And more would be dangerous because both Intel and AMD have errata with rep movsq > 4GB" haha
17:44:00 <mrvn> WTF? I have to rep movsq in blocks of 4GB? hehehehe
17:47:00 <mrvn> That's like DBcc on m68k only using the lower 16-bit of the counter register.
17:48:00 <mrvn> Can't do a full 32bit ripple carry addition, comparison and branch in the wanted cycle time
17:57:00 <mjg> now should i code the proram inRUST
17:58:00 <mjg> MOTHERF^Wi don't think
18:01:00 <mjg> NAME shuf - generate random permutations
18:01:00 <mjg> check htis out
18:18:00 <heat> mjg, CHECK WHAT OUT
18:18:00 <heat> YOURE MAKING ME MAD
18:18:00 <mjg> OH
18:19:00 <mjg> STFU
18:19:00 <mjg> i'm saying there is a ready-to-use tool to randomize the numbers
18:22:00 <zid> I can generate permutations in O(n) in both time and memory, in 3 lines of code
18:23:00 <zid> good enough for me
18:26:00 <zid> (LFSR with a cycle length the same as the sequence length can do O(1))
18:26:00 <heat> mjg, is there any benefit in interleaving loads and stores?
18:26:00 <zid> not on architectures that matter
18:26:00 <zid> yes on architectures that don't
18:26:00 <heat> I think they (Intel) explicitly say there is for SIMD
18:26:00 <zid> like mips, and atom
18:26:00 <zid> and avx512
18:27:00 <mjg> heat: for simd i don't know
18:47:00 <zzo38> Do you have any suggestions about specific changes to my designs, or if anything about it is unclear, etc?
18:56:00 <zzo38> Perhaps one thing I did not mention about the file records, is that the records do not all have to be the same size, and the record numbers do not have to be contiguous (it is likely that many record numbers will be skipped, since that file does not use them)
18:58:00 <zzo38> Does the design of capabilities makes sense, or do you suggest changes?
19:00:00 <zzo38> (It seems to be a problem of other operating systems, that do not properly support making locks and transactions that have several resources grouped at once; they usually only can do them separately.)
19:22:00 <heat> mjg, actually im wondering now if any of the ALIGN_TEXTs matter for small-ass sizes
19:22:00 <heat> at that point you're already doing something very pessimal, have gone through several branches, just for a 1-8 byte copy
19:23:00 <mjg> they do matter a tad bit
19:23:00 <heat> so wouldn't it be better to save on icache?
19:23:00 <mjg> once the target is far enough from the jump instruction you suffer from it not being laigned
19:23:00 <heat> yes but icache footprint
19:24:00 <mjg> they are most likely useless/harmful if you roll with a "switch" upfront
19:24:00 <mjg> it is a tradeoff, see once more the reasoning for sizes 32-256
19:24:00 <heat> yes
19:25:00 <heat> also I think bionic memmove does test fuckery instead of cmp
19:25:00 <heat> maybe worth a shot
19:25:00 <mjg> i checked in agner fog's tables
19:25:00 <mjg> it's literally the same shit
19:25:00 <mjg> in the cpu
19:25:00 <heat> really?
19:25:00 <mjg> yea
19:25:00 <heat> wtf
19:26:00 <mjg> i mean ports used and whatnot
19:26:00 <heat> yes
19:26:00 <mjg> cycle cost
19:26:00 <mjg> basically no diff that i culd bench
19:26:00 <mjg> and see above why
19:26:00 <heat> i would expect an AND operation to be a good bit better than cmp
19:26:00 <heat> guess not
19:26:00 <mjg> i think what actually costs is the fucking branch mate
19:26:00 <mjg> als note instruction fusing
19:27:00 <heat> yep
19:27:00 <mjg> that said, it may be there is a funny corner case
19:27:00 <mjg> absent good reason to *not* follow bionic n this one, i would argue you *should* do it
19:28:00 <mjg> 'looks the same so we gonna go the other way' is what i gave people shit for in the past
19:28:00 <heat> well yes but otoh that memmove isn't all that great AND it was written in 2014
19:28:00 <heat> almost 10 years ago
19:28:00 <mjg> is not this where your cpu is from
19:28:00 <mjg> :XX
19:28:00 <heat> no
19:29:00 <mjg> jinkies
19:29:00 <heat> kabylake is 2016, kabylake R is 2017
19:29:00 <mjg> look at mr modern man ova here
19:30:00 <heat> i bet you're using haswell
19:30:00 <mjg> i really should have added more comments
19:30:00 <mjg> to all that stuff
19:30:00 <mjg> i just rediscovered why 'weird bit' is actually good
19:31:00 <heat> what weird bit?
19:31:00 <mjg> in memset 32 or more i do
19:31:00 <mjg> cmpb $16,%dl
19:31:00 <mjg> ja 201632f
19:31:00 <mjg> movq %r10,-16(%rdi,%rdx)
19:31:00 <mjg> movq %r10,-8(%rdi,%rdx)
19:31:00 <mjg> as in the tail bigger than 16 is handled separately
19:31:00 <mjg> turns out overlapping 16 bytes when it can be avoided is tolerable
19:31:00 <mjg> 32 is a major bummer
19:36:00 <moon-child> heat: all basic arithmetic is single cycle for a long time now
19:40:00 <heat> why do you cmp on the actual 8/16-bit reg
19:40:00 <heat> is there any advantage in doing that
19:40:00 <mjg> it is smaller code
19:40:00 <mjg> mr ifunc
19:40:00 <mjg> erm icache
19:40:00 <heat> ifunc, icache, icrap
19:40:00 <mjg> iPhone
19:40:00 <mjg> irepstos
19:40:00 <heat> yes smaller code and then you blow it out the water with a nice 10-byte nop or whatever
19:41:00 <mjg> but i can fit more in there if needed
19:41:00 <mjg> mofer
19:41:00 <moon-child> won't somebody please think of the bytes!
19:42:00 <mjg> aight, got a db of 520684993 real-world calls to memset
19:42:00 <heat> export it to SQL and query away
19:42:00 <mjg> i'm gonna do it on the cloud mate
19:43:00 <heat> oracle database moment?
19:46:00 <heat> I feel dirty using r8d and r8w
19:47:00 <nikolar> nah it's fine
19:47:00 <heat> it's not fine
19:47:00 <heat> 1) needs an extra prefix
19:47:00 <heat> 2) clunky naming
19:51:00 <mjg> you don't need these regs
19:51:00 <mjg> i only used them so that i can safely embedd into copyin/copyout
19:51:00 <mjg> which already use some regs
19:51:00 <mjg> and i dnot want to save/restore
19:54:00 <heat> i do need them
19:54:00 <heat> rdi, rsi, rdx are used by the args, rax is primed with the return value
19:54:00 <heat> so that leaves me with rcx, r8, r9, r10, r11
19:54:00 <mjg> see bionic
19:58:00 <heat> <heat> bionic saves rbx
19:58:00 <mjg> wut
19:59:00 <heat> yes
19:59:00 <heat> although they do have a funny trick here where they reuse rsi for the last load when doing the tail copying
20:04:00 <mjg> just be happy this is not ia32
20:04:00 <mjg> famine register state
20:04:00 <moon-child> ia64
20:04:00 <moon-child> 128 registers
20:04:00 <mjg> bring back itanium!!
20:04:00 <moon-child> everything else is trash by comparison
20:04:00 <heat> YESSIR
20:04:00 <mjg> onw i'm curious how a memset runs there
20:04:00 <mjg> i mean looks like
20:05:00 <heat> https://elixir.bootlin.com/glibc/latest/source/sysdeps/ia64/memcpy.S
20:05:00 <bslsk05> elixir.bootlin.com: memcpy.S - sysdeps/ia64/memcpy.S - Glibc source code (glibc-2.37.9000) - Bootlin
20:06:00 <heat> it looks stunning
20:06:00 <heat> as in "i'm stunned wtf is going on"
20:06:00 <moon-child> 'memcpy assumes little endian mode' wat
20:06:00 <moon-child> why doesn't it matter? Don't loads have the same endianness as stores either way?
20:06:00 <heat> KEEP HATING moon-child
20:07:00 <moon-child> lol
20:07:00 <mjg> haters gonna hate
20:07:00 <mjg> fuck you moon-child
20:07:00 <mjg> !!!
20:09:00 <heat> shut up mjg cpu architecture fascist
20:09:00 <heat> mjg? more like bitchjg
20:09:00 <mjg> E10k or bust motherfucker
20:10:00 <mjg> https://www.youtube.com/watch?v=OSprsQTsy7c
20:10:00 <bslsk05> www.youtube.com: Sun Enterprise 10000 - YouTube
20:12:00 <heat> why does bionic memcpy also handle memmove?
20:13:00 <heat> is this mildly concerning?
20:13:00 <mjg> it used to be that glibc did it
20:13:00 <mjg> and trying to not do resulted in buggz
20:13:00 <heat> is this an actual compatibility concern?
20:13:00 <mjg> depends, i don't know if glibc is doing it today
20:13:00 <mjg> people claim it is not
20:14:00 <heat> generic memcpy isn't I think
20:14:00 <heat> so...
20:14:00 <zid> because you can't trust people who'd use bionic
20:14:00 <mjg> that funky memcpy does
20:14:00 <zid> to stick to the actual semantics of memcpy
20:14:00 <moon-child> I would rather check for overlap and fault if so
20:14:00 <heat> their generic memcpy also supports page moving for GNU hurd
20:14:00 <moon-child> fix yo shit
20:15:00 <zid> I'm actually really lazy about using memcpy instead of memmove >_<
20:15:00 <mrvn> memcpy() should check and assert so bad code gets fixed.
20:15:00 <mjg> fucking
20:15:00 <heat> fucking. - mjg
20:15:00 <mjg> i wrote that toy prog i mentioned, very wip
20:16:00 <mjg> runtimes vary wildly
20:16:00 <mjg> turns out the total time is so long it gets preempted
20:16:00 <mjg> :d
20:16:00 <heat> toy prog for what?
20:16:00 <mjg> heheszek-read-bionic time 8920719533
20:16:00 <mjg> heheszek-read-erms time 10307939142
20:16:00 <mjg> heheszek-read-current time 8229679417
20:16:00 <mjg> heheszek-read-current time 10446866317
20:16:00 <mjg> heheszek-read-current time 6845942134
20:16:00 <mjg> running the 50 mln memsets
20:16:00 <geist> yah i think most libcs i've looked at simply have memcpy and memmove be the same symbol
20:16:00 <mrvn> mjg: pin the test to the core and pin everything else away from it
20:16:00 <mjg> mrvn: i already did
20:16:00 <geist> is it silly? yeah, but then really having two separate symbols is
20:17:00 <mjg> i may need to boot on linux and isolate cpus
20:17:00 <mrvn> mjg: then linux tickless implementation sucks.
20:17:00 <geist> it's like sprintf or gets, they're bad ideas from an older era
20:17:00 <mjg> mrvn: that's on freebsd :)
20:17:00 <mjg> mrvn: will do it on linux
20:17:00 <mrvn> mjg: did you remember to pin the IRQs too?
20:17:00 <mjg> i can't do that on that sucker
20:17:00 <mjg> again, will do it right on linux, but boomer i have to resort to it
20:17:00 <heat> geist, i think separate memcpy still makes sense. you optimize out a branch
20:18:00 <geist> but you might break code that misuses it
20:18:00 <heat> just like having separate memcpy_aligned_N or memcpy_nt makes sense
20:18:00 <geist> also means you need to write two implementations
20:18:00 <zid> It makes very sense for specifically a language like C to make both available as builtins
20:19:00 <zid> why use C if you don't want optimizations like that to happen and break your code, use rust :P
20:19:00 <heat> ruuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
20:19:00 <geist> ruuuuuuuuuuuust
20:19:00 <geist> though rust will probably just call throgh to memcpy to be fair
20:19:00 <geist> but it should know if things overlap
20:19:00 <heat> that's right, if you don't want memcpy, use go
20:20:00 * heat watches as they reimplement memcpy
20:20:00 <moon-child> you can't know if stuff overlaps
20:20:00 <moon-child> in general
20:21:00 <moon-child> cus you can compute arbitrary subscripts into an array
20:21:00 <mjg> everyonei s a smartass until they need to code a website
20:21:00 <geist> rust can probably via a series of rules know that any two objects can't overlap
20:21:00 <geist> it's only situations where you're moving stuff within the same array
20:21:00 <mrvn> moon-child: but then the compiler knows it's the same array
20:21:00 <moon-child> sometimes you're in the same array and you know your subscript never overlap
20:21:00 <geist> rustin particular would know this. C/C++ wouldn't
20:21:00 <moon-child> I'm not saying it doesn't sometimes know, I'm saying it doesn't always know
20:21:00 <moon-child> c can restrict
20:22:00 <mrvn> moon-child: If it can't know if ranges of objects overlap then it can call memmove
20:22:00 <geist> does make me think, i do remember there was a fair clever sequence of instructions to detect overlap, and how much
20:23:00 <mjg> i once more note some of the intel claim yu need to check relative addresses for performance reasons anyway
20:23:00 <geist> and also interestingly, depending on how two thins overlap and by how much and how long your copy stride is you can still a lot of the time use your core algorithm
20:23:00 <mjg> and if you do that, the issue goes away
20:23:00 <heat> why is no one on the page loaning koolaid
20:23:00 <geist> that'd have to be fast as fuck to beat a copy
20:23:00 <mrvn> if (abs(dst - src) > 256) memcpy_I_dont_care()
20:24:00 <moon-child> geist, first thing that comes to mind is sign bit of (startx - starty) xor (startx + len - starty)
20:24:00 <geist> moon-child: something like that
20:24:00 <geist> mrvn: hmm, but does that work with both directions of overlap?
20:24:00 <mrvn> geist: no. sometimes you have to copy backwards.
20:24:00 <geist> if your algorithm copies forward then one of the overlaps will be problematic, i think
20:25:00 <mrvn> geist: if (abs(dst - src) > len) memcpy_its_safe()
20:25:00 <moon-child> even if they overlap the 'wrong' way, as long as it's less than your buffer size, it's fine
20:26:00 <mrvn> moon-child: no. some overlaps you have to copy backwards.
20:26:00 <geist> moon-child: only if you copy in the right direction, because otherwise you'll start ovewriting source data before you get to it
20:26:00 <moon-child> err no
20:26:00 <moon-child> ignore me
20:26:00 <mrvn> You can ignore backwards copying when len < buffer_size
20:26:00 <geist> but yeah you *should* be generally ablve to write a reverse copy version of your algorithm at pretty much the same speed
20:26:00 <geist> so now you have to write two versions of the whole thing, forward and backwards
20:27:00 <geist> then handle the overlap case where they're too close: which is probably just revert to a bytewise forward or back copy
20:27:00 <geist> then that's basically the guts of memmove
20:27:00 <mrvn> Copying backwards is bad for the performance though. So you probably want to coppy smaller chunks forward if the overlap is larger than your buffer size.
20:27:00 <geist> not really. i think most decent architectures detect reverse stride just as well
20:27:00 <moon-child> why would backwards be bad?
20:27:00 <mrvn> e.g. on an overlap with 64k offset
20:28:00 <moon-child> yeah prefetcher can detect backwards accesses
20:28:00 <mrvn> moon-child: because RAM chips are made to read incremting sequential.
20:28:00 <geist> oh modern stuff is so abstracted from what ram chips are doing that's immaterial
20:28:00 <mrvn> ok
20:28:00 <geist> but sure you can still do say 64 byte chunks forward as you step backwards
20:28:00 <mrvn> that info is maybe 12 decades old.
20:29:00 <geist> i mean the ram chip probably does have a open next row thing (depending on if it's row or column first) but then the rows are large, and i think you can easily address backwards in it
20:29:00 <mrvn> geist: would it matter for 64byte? That's a cache line. I don't think it matters in what direction you access a cache line.
20:29:00 <geist> right
20:30:00 <geist> hence why that wouldn't really matter if it did it forward or backwards within the cache line, probably.
20:30:00 <mrvn> geist: I was more thinking about 1k or so. Where the prefetcher would fail to get the next cache line ahead of time.
20:31:00 * geist nods
20:31:00 <moon-child> there is sub-cacheline structure. But I don't think order matters
20:31:00 <geist> now ifi t's just that level of arch where it can't really dtect prefetchig stride, then yeah you might be worse performance
20:31:00 <mrvn> And then as you say the next size is the row of the DIMMs. Every time you have to load a new row address you loose time.
20:32:00 <geist> but we're already talking about the sub case where things overlap in some way. if A is less than B or vice versa with no overlap there's no reason doing a backwards copy
20:32:00 <moon-child> (I think usually, sub-cacheline, is organised into groups of 8 bytes. A misaligned 8-byte load can grab from two 8-byte groups, doesn't matter if they're in the same line, hence misaligned 8-byte loads may be fast even when they cross a cache line boundary, contra wider loads)
20:32:00 <geist> honestly you could probably just revert to bytewise for overlap cases and probably wouldn't be a big deal
20:32:00 <geist> i mean it would be sub optimal but that's sort of a TODO case
20:33:00 <mrvn> checking for all the overlap cases though is a couple of branches. So for memcpy() it's worth it not to need those.
20:33:00 <geist> lots of overlap is usually folks moving strings around, and they're probably fairly small *or* it's something incredibly stupid
20:33:00 <geist> again depends on if you trust all users of memcpy to do the right thing. i'm not sure if that's a wise idea because of the existing implementations that just union the two things
20:34:00 <geist> and sure that's broken code, etc etc, but that's also the life of an OS hacker
20:34:00 <geist> dealing with dumbass users
20:35:00 <mrvn> geist: are you calling me dumbass? :)
20:36:00 <geist> welllll
20:36:00 <mrvn> being the only user has the pro and con that all users are as dumb as I expect them to be.
20:37:00 <geist> yah
20:38:00 <mrvn> I like having assert()s that check for overlap on memcpy though. Because I know I'm dumb. :)
20:47:00 <geist> you can certainly do it in a wrapper without messing with the core implementation
20:49:00 <mrvn> It's best placed (or at least duplicted) in the .h file so the compiler can evaluate it at compile time where possible. Only the case where everything is unknown should call the full memcpy/memmove with all the branches.
20:58:00 <heat> mjg, turns out 256 is indeed too conservative for erms
20:59:00 <heat> i bumped it up to 512 here
20:59:00 <mjg> that's pushin it
20:59:00 <heat> it is
20:59:00 <mrvn> mjg: with all your memcpy benchmarking can you say at what point memcpy will be slower than chaning page tables?
21:00:00 <heat> also I assume erms on backwards sucks harder than a manual copy?
21:00:00 <mjg> mrvn: i'm only benchin small sizes
21:00:00 <mjg> mrvn: < 256
21:00:00 <mjg> mrvn: this is for kernel memcpy
21:01:00 <mjg> et al
21:01:00 <mjg> heat: yea
21:01:00 <mrvn> that's exactly where I would move pages around too
21:01:00 <geist> i dunno unless you have an extremely specialized situation, moving around pages is very expensive
21:01:00 <geist> you'd have to be local cpu, involve no TLB cross cpu TLB shootdowns, etc
21:02:00 <mjg> i once more refer you to https://people.freebsd.org/~mjg/bufsizes.txt
21:02:00 <heat> moving pages around's cost probably scales with the number of CPUs involved
21:02:00 <mjg> see memmove_erms et al
21:02:00 <mjg> tons and tons of ops are super small
21:02:00 <mrvn> geist: no threads, no shared memory, so no cross cpu worries there.
21:02:00 <zid> what's an erms
21:02:00 <geist> no matter what you do you should probably optimize a page sized memset and page sized copy
21:03:00 <geist> hypothetically that should be most of what your kernel does
21:03:00 <zid> I feel it isn't efficient reverse memory sausage
21:03:00 <mjg> i have to go afk now, can respond later to whatever
21:03:00 <mrvn> geist: it is, so far.
21:03:00 <geist> mrvn: sure. ie, the extremely specialzied situation
21:03:00 <heat> zid, Enhanced REP MOVSB
21:04:00 <geist> in more general purpose things, it's hard to justify fiddling with page tables at this level
21:04:00 <zid> oh right
21:04:00 <mrvn> still, would be nice to know at what size a local page remap, global page remap, cross cpu shootdown, ... becomes cheaper than copying
21:04:00 <geist> i think it made sense farther back in time, but it's one of these cases where modern cpus are faster copying things than the overhead of fiddling with the mmu. plus all the cross cpu shootdowns
21:05:00 <mrvn> isn't that reversing with the cross cpu shootdown mechanics that don't use IPIs?
21:06:00 <geist> like in ARM64? possibly, but that's also not free
21:06:00 <heat> they are NOT free
21:06:00 <geist> for example you'd probably have to to break-before-make
21:06:00 <heat> and also O(n) with the number of CPUs involved
21:06:00 <geist> since you're replacing one page with another
21:06:00 <mrvn> Not free. Cross cpu shootdowns just have become so expensive that they are getting optimized now.
21:06:00 <geist> the ARM TLB shootdown stuff is nice because it doesn' thave to do a full IPI, but there's still cose
21:07:00 <geist> the new intel TLB shootdown proposal effectively bumps the IPI up to some sort of pseudo microcode/SMM mode thing
21:07:00 <geist> so hypothetically that helps a bit, since it's not a full interrupt
21:08:00 <mrvn> I just want to invalidate the other cores TLB cache entry.
21:08:00 <geist> and of course AMD has their solution to
21:08:00 <heat> why are x86 vendors so stubborn?
21:08:00 <heat> screw them both
21:08:00 <geist> yay!
21:08:00 <geist> go Via! be the dark horse third one
21:08:00 <heat> i want a bootleg soviet 386
21:10:00 <geist> anyway re: TLB shootdown on ARM. it *has* to be more efficient than an explicit IPI like you get on x86 or riscv, but i honestly dont know the numbers
21:10:00 <geist> it's one of these things where you really just dont have any other choice, since there's functionally no other way to do it
21:10:00 <geist> so hypothetically the tlb shootdown is 'free' but probably in reality it's highly based on the microarchitecture, how many cpus are active, what they're doing at the time, etc
21:10:00 <geist> i've never actually measured it to be honest
21:12:00 <mrvn> geist: it should really be the same cost as a local TLB shoootdown except you send it to the other cache. The only costly thing would be when 2 cores try to do it at the same time. Then one has to somehow wait etc...
21:13:00 <mrvn> out of curiosity: Does ARM support multiple sockets?
21:13:00 <geist> yeah
21:13:00 <dh`> there must be multiple-socket arm64 boards by now
21:13:00 <geist> yep. i have one right here
21:15:00 <dh`> the thing I have always wondered about hardware-assisted tlb shootdown is: tlb shootdown has to ultimately be synchronous (that is, you have to wait for it to complete before you continue with whatever VM ops you're doing) but suspending the current processor entirely for that time seems like a bad plan
21:15:00 <geist> the mechanism by which the TLB invalidates gets broadcast is not documented, but presumably it's somewhat like a cache eviction thing
21:15:00 <zid> geist loves his athlon 64 x2 x2
21:15:00 <geist> dh`: the mechanism ARM (and AMD) do is you start the eviction with an instruction, and then later on there's another instruction to sync
21:15:00 <geist> DSB on arm, TLBSYNC on AMD
21:15:00 <mrvn> dh`: it's just a pipeline flush
21:15:00 <geist> so that lets you at least do some other work in the middle
21:16:00 <dh`> sounds like not a full switch to another thread though
21:16:00 <geist> so it does mitigate that somewhat. you can schedule multiple flushes in your code and then synchronize on the way out
21:16:00 <dh`> but maybe the latencies aren't high enough for that to matter
21:16:00 <heat> no, you don't switch
21:17:00 <heat> the point is that you can keep batching tlb shootdowns as you go and TLBI'ing them
21:17:00 <geist> what is annoying about swapping pages vs just general unmap/etc is you have to do break-before-make, which is somewhat more synchronous
21:17:00 <heat> then at the end you tlbsync/dsb
21:17:00 <geist> ie, yo uhave to right then and there shoot down the TLB and wait for it to complete before you can put the new entry in
21:17:00 <dh`> with an IPI it's quite feasible to run another thread
21:17:00 <mrvn> dh`: you only need to sync before the next instruction that could access the evicted address.
21:17:00 <dh`> since even without latency from interrupts being off on some of the other cpus, the interrupt and interrupt dispatch takes considerably longer than two thread switches
21:18:00 <mrvn> geist: why? I would say the opposite, swapping pages is make-before-break
21:18:00 <heat> i mean, you could schedule out I think?
21:18:00 <geist> nope. not on ARM. it's complicated
21:18:00 <geist> it's all about avoiding the situation of having conflicting TLB entries on multiple cpus at the same time
21:18:00 <mrvn> geist: you write the new address into the page table, you invalidate the TLB.
21:18:00 <geist> certain subsets of TLB shootdowns involve break-before-make
21:19:00 <geist> nope. that's not how it works mrvn
21:19:00 <geist> there's a whole treatise in the ARM manual about why you need break-before-make, and which precise situations you need it
21:19:00 <mrvn> geist: doesn't that garanty that after sync all cores will have the same entry (or none)?
21:19:00 <geist> it guarantees that *at all points in the sync* they have the same entry
21:19:00 <geist> which is mandatory for Reasons
21:19:00 <mrvn> urgs.
21:19:00 <geist> ie you remove the old entry, TLB sync, then add the new entry
21:20:00 <geist> ie, break-before-make
21:20:00 <mrvn> I would have though you just ignore the big race hole between make and break. It's UB anyway.
21:20:00 <geist> so only subets of page table modifies hit this case, but changing what is mapped at a particular slot to something else is one of them
21:20:00 <geist> unmapping or mapping doesn't cause it, of course
21:21:00 <mrvn> geist: does the reason involve the A/D bits?
21:21:00 <geist> yes
21:21:00 <dh`> I vaguely remember having this discussion once before here
21:21:00 <mrvn> ahh, not using (or having even) those.
21:21:00 <geist> everything to do with other cores having weak memory model writeback to A/D bits and having out of sync TLB entries
21:21:00 <geist> mrvn: congratulations
21:22:00 <mrvn> geist: the A/D bits can get triggered between make and break and then you indeed have problems.
21:22:00 <geist> yep, that's the main issue, and there's some other subtle reason
21:22:00 <geist> i encourage you to read the manual on the topic before you fiddle too much, so at least you kow if you're playing with fire or not
21:22:00 <mrvn> Is A/D hardware bits mandatory on AArch64 or still optional?
21:23:00 <geist> >=v8.1
21:23:00 <dh`> you can get inconsistent reads if it's a page shared with another process
21:23:00 <geist> >=v8.1
21:24:00 <mrvn> geist: I have no shared memory, no mmap, no page getting remapped actually. I (so far) only have map, unmap and move between 2 page tables.
21:24:00 <geist> congratulations
21:24:00 <mrvn> geist: I'm sticking with a verry simplistic memory model for reasons. :)
21:24:00 <geist> i understand you have a very simple system that bypasses most of these concerns, but it really doesn't matter to me (or probalby anyone else)
21:24:00 <geist> it's not useful to recommend things to other people based on your personal needs
21:24:00 <mrvn> just saying why I haven't run into this issue
21:25:00 <geist> sure
21:25:00 <geist> but again you should probably read the manual on the topic. i think there's some other reason why break-before-make may be necessary
21:26:00 <dh`> suppose it's a copy-on-write remap; process 1 thread 1 reads 6, goes to write 9, updates the mapping, starts shootdown, process 2 writes 7, process 1 thread 2 reads the 7, then the shootdown completes
21:26:00 <mrvn> dh`: if you have a shared page and one thread/process remaps the page without some form of synchronization then you have a race condition already.
21:26:00 <geist> there's some trickery when detaching page tables that involve a BBM style thing
21:26:00 <heat> gang, i need assembly help
21:26:00 <heat> jz .Lout
21:26:00 <heat> sub $32, %rdx
21:26:00 <heat> cmp $32, %rdx
21:26:00 <heat> jae .L32b_backwards
21:26:00 <geist> to keep other cpus from having a page table cache entry floating around before the page is reused
21:26:00 <heat> why is the cmp not redundant?
21:26:00 <heat> WAIT
21:26:00 <heat> im stupid
21:26:00 <dh`> when rdx starts out at 65 :-)
21:26:00 <geist> yay duck debugging
21:28:00 <mrvn> dh`: In your example both processes would trigger a page fault. The original entry is read-only.
21:28:00 <dh`> no? p1 t1 was already doing the page fault, p1 t2 was only reading, p2 might not have it readonly
21:29:00 <mrvn> dh`: process 2 writes 7 ==> page fault
21:29:00 <heat> geist, what does zircon use to handle copyin/out page faults?
21:29:00 <heat> fixup table?
21:29:00 <dh`> p2 might not have it readonly
21:29:00 <mrvn> dh`: It's COW, it must be read-only.
21:29:00 <dh`> says who? see MAP_PRIVATE
21:29:00 <geist> fixup table i think, depending on precisely what you mean by fixup table
21:29:00 <mrvn> dh`: that's how COW work. Both sides of the COW get a read-only entry.
21:29:00 <dh`> I mean, arguably MAP_PRIVATE is a bug, but it's the agreed-upon default behavior
21:29:00 <heat> struct fixup_entry {u64 ip; u64 fixup_ip;} table[];
21:30:00 <mrvn> dh`: the page only becomes read-write when you resolve the COW and then the page is no longer shared.
21:30:00 <dh`> mrvn: I repeat, MAP_PRIVATE
21:31:00 <geist> dh`: i think the key here is at some point after the page table entry is updated either the other process still sees the RO version (and page faults) or the RW, but it's okay for that to be slippery
21:31:00 <mrvn> dh`: MAP_PRIVATE has nothing to do with that.-
21:31:00 <heat> like, erm, I want to plug this into copy_to/from_user but adding fixup entries for every single memory access sucks
21:31:00 <dh`> sure it does
21:31:00 <geist> worst case the second cpu page faults, then goes in to discover the page is already RW and shrugs and retries
21:31:00 <dh`> MAP_PRIVATE allows other processes' changes to show through to you until you trigger your own copy
21:31:00 <geist> heat: ah no we just do it as a register/deregister thing
21:31:00 <dh`> so some other process could be making such a change
21:31:00 <mrvn> dh`: when you MAP_PRIVATE you get a COW read-only mapping of everything. When you fault on a write you get a not shared copy.
21:31:00 <heat> ah, like BSD then
21:31:00 <geist> ie, the start f the copyin/copyout code sets a recovery pointer in the current thread structure
21:31:00 <heat> yep
21:32:00 <geist> i think that works reasonably well
21:32:00 <geist> it's not balls to the wall optimally fast, but i think it's a reasonble compromise
21:32:00 <geist> having the pre-canned table seems like a microoptimization
21:32:00 <dh`> geist: defining magic ranges for the trap handler to treat specially shifts a couple instructions out of the common fast path
21:32:00 <geist> yep
21:32:00 <mrvn> dh`: process 2 doing a MAP_PRIVATE won't allow it to write to pages so process 1 sees it
21:32:00 <dh`> yeah, micro-optimization
21:32:00 <dh`> no, mrvn
21:33:00 <dh`> process 2 is doing something else, maybe via MAP_SHARED, maybe just write(2)
21:33:00 <dh`> *you* have the MAP_PRIVATE and you're in the middle of copying
21:33:00 <mrvn> dh`: ahh, that is a horse of another color
21:33:00 <dh`> not really
21:34:00 <mrvn> mixing write and mmap without syncronization is a race condition. So yeah, you might get 7 or 9. It's a race.
21:34:00 <dh`> actually the example I provided is invalid but you can construct other more complicated ones that involve different addresses on the same page
21:35:00 <dh`> the point is that you can get traces where thread 1 sees that the copy happened before process 2 and thread 2 sees that it happened afterward
21:36:00 <dh`> and if you then do this on two separate pages you can get an observable inconsistency
21:36:00 <mrvn> dh`: sure. hence the need to synchronize. One way to do that is to first break the mapping like geist says you need to do anyway. Then all processes run into a fault and that blocks till your finished remapping everything.
21:37:00 <dh`> right
21:37:00 <dh`> that was the point, ultimately
21:38:00 <dh`> you can't engage the new mapping until the old mapping is revoked everywhere
21:38:00 <dh`> that's also what I was blathering about when I was talking about waiting for completion
21:38:00 <mrvn> dh`: if it involves user processes I would still think all those cases are actualy UB or race conditions. The kernel doing COW is a separate matter and the kernel needs to synchronize that.
21:38:00 <dh`> there's no UB at the machine level
21:39:00 <heat> ARM does have some UB
21:39:00 <dh`> and also, things like mprotect changes that can be triggered from userlevel come with implicit atomicity guarantees
21:39:00 <heat> in the IMPLEMENTATION DEFINED stuff or whatever they call it
21:39:00 <dh`> typically processors only allow unprivileged execution to be UNPREDICTABLE rather than UNDEFINED because the latter is bad for security
21:40:00 <mrvn> dh`: if one thread calls mprotect to make a page read-only and another thread writes to it then it's undefined wether the other thread segfaults or not. Depends on the exact timing. That kind of stuff.
21:40:00 <dh`> no, but it must happen either before or after
21:41:00 <dh`> and if you let that become fuzzy it becomes possible to construct ways to observe the inconsistency
21:41:00 <mrvn> even that might not be the case with compiler and cpu reordering stuff
21:41:00 <dh`> at least in unix it's a basic guarantee of the syscall api
21:42:00 <mrvn> A lot of that stuff predated threads :)
21:42:00 <dh`> and a lot of stupid stuff had to be sorted out when it became possible to have multiple threads observing in a single process
21:44:00 <dh`> I think at this point posix doesn't make promises about what happens if you mprotect or unmap regions that are arguments to read/write when those calls are in progress
21:45:00 <netbsduser> i am pinning my buffers nowadays
21:45:00 <mrvn> dh`: where did you see that mprotect is atomic? man 7 pthread doesn't have it in the list of thread safe functions and the manpage doesn't mention it here.
21:45:00 <netbsduser> if you do read/write it wires down the underlying pages
21:45:00 <dh`> all system calls are atomic unless explicitly stated otherwise
21:46:00 <dh`> anyway there's valid reasons for wanting mprotect to be atomic and no real excuse for fumbling it :-)
21:46:00 <mrvn> dh`: then they would all be thread safe
21:47:00 <dh`> what specifically do you mean by "thread safe"?
21:47:00 <geist> i think in general the rules are you can't observe thing in any particular order, but you at least observe old or new state, and nothing in between
21:47:00 <mrvn> never mind, found it.
21:47:00 <geist> that's really the only reasonable thing anything can guarantee
21:47:00 <mrvn> dh`: "A thread-safe function is one that can be safely (i.e., it will deliver the same results regardless of whether it is) called from multiple threads at the same time.
21:47:00 <mrvn> "
21:48:00 <dh`> all system calls that are actually system calls should be thread-safe in that sense
21:48:00 <dh`> that is very basic
21:48:00 <mrvn> dh`: they are "except for the following functions: ..."
21:48:00 <geist> well, that's not really true intrinsically. you generally have to jump through at least some number of hoops to guarantee that
21:48:00 <geist> like, say the file descriptor moves the cursor atomically
21:49:00 <dh`> calls that are allowed to be wrappers in libc are somewhat different
21:49:00 <geist> or, a file descriptor is either open or not at any given point
21:49:00 <mrvn> dh`: anything with static buffers is on that list
21:49:00 <dh`> there are no syscalls with static buffers
21:49:00 <dh`> it doesn't make sense
21:49:00 <mrvn> dh`: but way too many libc functions
21:49:00 <geist> what is really hard to do is guarantee that at the completion of a syscall all of its results are observed everywhere
21:49:00 <dh`> yes but that's an entirely different issue
21:49:00 <geist> easy to do up until you have multiple threads calling things simultaneously
21:50:00 <mrvn> geist: or even still valid
21:50:00 <mrvn> dh`: that was in reference to "wrappers in libc"
21:50:00 <geist> right, we ended up for zircon declaring the model is fairly loose
21:50:00 <dh`> right
21:50:00 <heat> is there no way to define descriptive function-local labels in GNU as?
21:51:00 <heat> .Lblah is not function local
21:51:00 <heat> this is exactly what mjg was talking about yesterday
21:51:00 <dh`> heat: only file-local
21:51:00 <dh`> what's a "function" in assembly anyway? not a meaningful concept
21:51:00 <heat> :^) shoot me
21:52:00 <dh`> geist: for anything that updates kernel state unlocking that state should force global visibility
21:52:00 <geist> yeah but the tough part is what does it do to syscalls that are simultaneously occurring on the same state
21:52:00 <geist> the canonical exampe is a syscall that frobs a handle simultaneously being called with a syscall that closes the handle
21:53:00 <dh`> right, one has to execute first
21:53:00 <geist> that's too difficult, without serializing the entire kernel
21:53:00 <dh`> that's part of what's meant by the atomicity guarantees for unix syscalls
21:53:00 <geist> the obvious 2 cases are: close happens first, frob fails. frob occurs first, close happens
21:54:00 <geist> but the 3rd and less obvioyus case is: frob occurs first, close happens *and* exits, frob continues to happen
21:54:00 <geist> ie you have a syscall that's still occurring on a handle that is now closed
21:54:00 <dh`> should not be, boils down to a single word-sized access of an entry in the descriptor table, even if you do everything unlocked one will go before the other
21:54:00 <geist> i'm not entirely sure posix defines that
21:54:00 <geist> the key is what happens when the frob syscall looks up the underlying object, gets a reference to it. it's now *past* the descriptor table.
21:55:00 <dh`> in principle it means the read happened before the close
21:55:00 <geist> it did, but then as a result the frob syscall goes about its business *after* the handle was closed
21:55:00 <dh`> even if all it actually means is that the read crossed the descriptor table before the close
21:55:00 <dh`> that defines the ordering
21:55:00 <geist> so you have to consider that case, or explicitly add machinery to make the close syscall wait until all outstanding operations on it have completed
21:55:00 <geist> we chose not to do the latter
21:56:00 <dh`> in order to have an inconsistency you need to then be able to observe something that shows that the close executed before the read
21:56:00 <geist> i think you're missing the point here. the point is not that the close happened before the *start* of the read. it's that the read is continiuing to happen
21:56:00 <geist> beause syscalls aren't atomic in units of time
21:57:00 <dh`> right, but can you construct an observation that shows that?
21:57:00 <dh`> you can call gettimeofday() after each call but that tells you nothing
21:57:00 <geist> absolutely. it's easy. do a blocking read on something and then close it
21:57:00 <geist> i actually dunno what posix does there. does it abort any read operations on the fd?
21:57:00 <geist> (probably). but is it defined that way
21:58:00 <geist> or is that just a side effect of implementation
21:58:00 <dh`> traditionally? the close affects the table, not the file object (or vnode)
21:58:00 <geist> exactly
21:58:00 <dh`> how do you observe that the read is still in progress though?
21:59:00 <geist> the fact that the read syscall is still occurring after T0, where T0 is where the close syscall returns
21:59:00 <dh`> how do you know it's occurring, and where do you get that stamp?
21:59:00 <geist> i'm not sure i understand this line of thinking. it's easy to observe all this stuff using standard observational stuff
22:00:00 <geist> ie, a universal clock to the computer
22:01:00 <dh`> sure, you can also in principle monitor the electron flows in the cpu
22:01:00 <dh`> but the semantically important question is what a program can observe
22:01:00 <geist> also i'm trying not to be posix specific here. i think posix sidesteps a lot of this by not having a tremendous number of types of frobs you can do on handle. i also think it sidesteps a fair number of these things by being implementation specific
22:01:00 <geist> thread B is still sitting in the read syscall after thread A has completed closing the handle
22:02:00 <dh`> can you distinguish that from thread B from having returned and stalled before doing anything else?
22:02:00 <geist> and after some reasonble time does not return with ERR_CANCELLED or whatnot
22:02:00 <geist> yes yes, i know what you're trying to do here, but that's not hte point
22:03:00 <dh`> it _is_ the point though because all these notions are relative to some model
22:03:00 <geist> so perhaps a better way of saying it is does thread B get its syscall cancelled as a reslt of thread A calling close
22:03:00 <geist> or does thread A wait until thread B exits, etc
22:03:00 <geist> and i simply dont know what posix states here, or if it states anything at all
22:03:00 <dh`> the whole point of having something more parallel than fetch one instruction at a time and execute it to completion is that there's supposed to be a model in which the execution is still consistent
22:04:00 <dh`> it's usually safe to assume that posix states either nothing or nothing helpful :-)
22:04:00 <geist> right. so my point is you have to define some sort of model ideally. even if the model if precisely undefined in some situations (ie, could be A or B but can't tell)
22:04:00 <dh`> at some point we discovered that someone had changed netbsd's close to behave in a manner other than the usual default and there was a big ruckus about it
22:04:00 <dh`> let me see if I can find that
22:05:00 <dh`> but I think the conclusion was that what whoever did was legal, just unexpected and possibly undesirable
22:05:00 <geist> what we did for zircon is since basically every syscall operates the same way: take a handle to a thing and do an operation on it. there's a phase in the syscall where the handle is looked up, and at that point the caller has access to it, even if the handle goes away simultaneously
22:06:00 <geist> and since handles cannot be modified, only added or removed, it avoids a bunch of races with handle changing permissions or whatnot
22:06:00 <mrvn> dh`: you can send thread2 a signal and see if read return EAGAIN
22:06:00 <geist> ie a slot in the handle table is in one of two states: present with a set of rights and pointing to an object, or empty. and can ony transition betweeo those two states
22:10:00 <mrvn> dh`: thread2: read(fd), thread1: close(fd); closed = true; signal(USR1); If read returns EINVL or something then close aborted the read. If read returns EAGAIN / closed is true then close() didn't abort for sure.-
22:11:00 <mrvn> you can add a sufficiently large sleep() after close to make it even more observable.
22:13:00 <dh`> and since you can't post signals explicitly to threads, what if the signal is only ever delivered to thread 1? :-)
22:13:00 <dh`> (that's being difficult, yeah, it's a possible way to observe it)
22:14:00 <dh`> but the question isn't whether close aborted the read, that you can presumably tell by whether the read fails with EBADF
22:14:00 <dh`> it's whether you can observe that the read is still running after close completed
22:15:00 <mrvn> dh`: how would that look like? close() aborts the read but then read still returns data written to the file after close()?
22:16:00 <dh`> anyway, I'll just retreat to the next fortification, which is that file descriptors being handles is part of the semantics of the unix system call API and there's no reason to require operations on handles to affect other operations that have passed the look-up-handle stage
22:16:00 <mrvn> ack
22:16:00 <dh`> mrvn: no, the idea is that close doesn't abort the read
22:16:00 <geist> yah that's the model i think that makse sense
22:16:00 <mrvn> dh`: but that part the signal would show.
22:17:00 <dh`> basically read looks up its filetable entry, close removes that entry, close returns, read continues
22:18:00 <mrvn> There is a grey zone in my test where close would abort the read(), the read doesn't return yet though and the signal then still happens to catch the sleeping read.
22:19:00 <dh`> I guess another more direct way to observe this is: thread 1 calls read, thread 2 calls close then writes to a pipe, thread/process 3 reads from the pipe and writes to thread 1's file, thread 1 reads that data
22:19:00 <mrvn> would be an odd implementation though for the read to be aborted but still catch signals and change the return code.
22:19:00 <geist> that being said i wonder what happens to network sockets
22:19:00 <geist> though that may be exactly why shutdown() exists
22:19:00 <mrvn> geist: shutdown is so you can close the sending side while still reading.
22:19:00 <dh`> mrvn: in most implementations it would post the signal handler but return EBADF and not EINTR
22:20:00 <dh`> most signal implementations, that is
22:20:00 <geist> hmm, you're right. so then what happens if you close a socket that has a pending blocking operation on it? seems in that particular case like it'd be silly to keep it going
22:20:00 <geist> since it could, hypothetically, block forever
22:20:00 <mrvn> geist: how ever would the read wake up in that case? It's not getting any more data.
22:20:00 <dh`> the argument I'd make is that if that's what you want you should call revoke rather than close
22:21:00 <mrvn> dh`: if you really want to know 100% then you have to read the kernel source.
22:21:00 <geist> i'm gonna bet it aborts the read, and it comes down to a case where posix is unclear and its implementation defined what happens
22:21:00 <heat> i think linux just wakes everyone up on shutdown
22:21:00 <mrvn> otherwise the test shows 99.9% sure
22:21:00 <heat> like t1: read(sockfd, ...); t2: shutdown(sockfd, RD) t3: read() = 0
22:21:00 <heat> s/t3/t1/
22:21:00 <mrvn> geist: I think a close on a socket or pipe means EOF so read should wake up.
22:22:00 <mrvn> Unlike on a file where cose doesn't change that.
22:22:00 <dh`> my expectation would be that even for a socket the close would only close the handle, not the socket, and the socket would go away after the read releases it
22:22:00 <mrvn> dh`: then the read never wakes up and the socket remains behind forever.
22:22:00 <geist> yah reading the man pages for close it doesn't really say what happens on simultaneous operations, but it seems to imply that if it's the last hadle then everything will be cleaned up
22:22:00 <geist> which implies that if something is blocking at least it'll get unblocked as the data structure its on is getting removed
22:23:00 <dh`> the natural implementation is to incref the file object when you fetch it from the file table for read, so you own a reference to it, and nothing under it goes away until you drop that reference
22:23:00 <dh`> the reason for whatever weird thing happened in netbsd was that someone wanted to avoid the atomic incref for that
22:23:00 <geist> but if it's something non blocking, like a read that is just taking time to copy data, it propbably as an implementation detail ends up waiting until that is finished
22:23:00 <mrvn> it's a valid but probably not that useful implementation
22:23:00 <dh`> mrvn: if you close the other end of the socket that will cause read to finish and return 0
22:24:00 <mrvn> dh`: I expect close() to close the socket. You are breaking that promise.
22:24:00 <geist> i think the idea there is there's a difference between bumpign a ref and holding onto the object, and the object itself getting cancelled such that any blocking operations bounce out
22:24:00 <mrvn> dh`: so the other end never sees the socket close and won't close it's own end.
22:24:00 <dh`> mrvn, that's not consistent with the existence of dup() let alone anything more complex
22:24:00 <geist> they are really two different things. the pending read can bounce out of something that is cancelled
22:24:00 <geist> if it's blocking
22:25:00 <mrvn> dh`: I assumed it's the close of the last copy of the socket.
22:25:00 <geist> if it's doing something non blocking, like page by page copying data out of a file cache, it could abort early or finish i suppose and still be consistent
22:26:00 <dh`> mrvn: but the thread reading has its own working copy of/reference to the socket
22:26:00 <dh`> if you wanted to revoke that you should have called revoke()
22:26:00 <mrvn> dh`: In my mind the process is this: close() -> socket close -> tcp close -> socket cancel blocked ops
22:26:00 <dh`> (and persuaded the maintainers of your kernel to support revoke on sockets)
22:26:00 <mrvn> dh`: the destruction of the tcp connection wakes up the read in the end.
22:27:00 <dh`> yes, but that's not the model you get by default
22:27:00 <mrvn> it's the behavior I expect posix systems to have. close on sockets should wake reads. Not sure what I expect on files.
22:28:00 <mrvn> The difference being EOF waking up read.
22:28:00 <dh`> it is definitely the case that that's not guaranteed, because like I said the natural implementation is that the read secures its own reference to the socket while it's working
22:29:00 <dh`> EOF will wake up read, but closing your file handle under the read doesn't cause that
22:29:00 <mrvn> dh`: so the socket would remain open forever? Even though the TCP side is closed?
22:29:00 <dh`> no
22:29:00 * dh` fails to understand what's so hard about this
22:29:00 <dh`> if you close the write end, the reader on the read end will wake up and exit
22:30:00 <mrvn> dh`: you close both ends with close()
22:30:00 <dh`> that's not a well-specified state
22:30:00 <dh`> close() closes file handles, not sockets.
22:31:00 <dh`> if you close the last reference to the write end, the reader on the read end will wake up and exit
22:31:00 <mrvn> then lets simplify: shutdown(fd, SHUT_RDWR);
22:31:00 <dh`> that should also cause any readers on the read end to wake up and exit
22:31:00 <mrvn> and close() should do something else on sockets?
22:32:00 <mrvn> In my mind close(fd) and shutdown(fd, SHUT_RDWR); should be the same.
22:32:00 <dh`> close closes file handles, not internal kernel objects
22:32:00 <dh`> they are not, because shutdown does not close the file handle
22:32:00 <geist> i'm not sure the man pages agree with that
22:33:00 <dh`> so your mind needs to visit a few man pages :-p
22:33:00 <geist> at least on linux and mac
22:33:00 <geist> both of them have verbiage to the effect of 'if it's the last file descriptor internal resources are freed'
22:34:00 <geist> lots of ways to read that but it seems to indicate that the fd count going to zero at least triggers some sort of internal shutdown path
22:34:00 <geist> even if there are still references to the objects floating around in the kernel
22:34:00 <mrvn> If fd is the last file descriptor referring to the underlying open file description (see open(2)), the resources associated with the open file description are freed;
22:34:00 <mrvn> *last file descriptor*, not internal reference
22:34:00 <dh`> maybe, it's not clear that whoever wrote that text even thought about pending references
22:34:00 <geist> not sure if these man pages are describing the behavior of the implementation of how its specced, however
22:35:00 <geist> s/of/or
22:35:00 <mrvn> possible
22:35:00 <dh`> and it's definitely inadvisable to impute intent regarding something to documentation that never considered it
22:35:00 <geist> the mac one is a bit more interesting
22:35:00 <mrvn> The manpage also says: "On Linux (and possibly some other systems), the behavior is different. the blocking I/O system call holds a reference to the underlying open file description, and this reference keeps the description open until the I/O system call completes. (See open(2) for a discussion of open file descriptions.) Thus, the blocking system call in the first thread
22:36:00 <mrvn> may successfully complete after the close() in the second thread."
22:36:00 <geist> "The close() call deletes a descriptor from the per-process object reference table. If this is the last reference to the underlying object, the object will be deactivated. For example, on the last close of a file the current seek pointer associated with the file is lost; on the last close of a socket(2) associated naming information and queued data are discarded; on the last close of a file holding an advisory lock the lock is
22:36:00 <geist> released (see further flock(2))."
22:36:00 <geist> the mac one seems to indicate it does the other path. ie, the object is closed when the last ref goes away, and internal refs also work
22:36:00 <dh`> geist: that's the same text we have in netbsd
22:37:00 <geist> yah probably derived from the same BSD docs
22:37:00 <dh`> yeah
22:37:00 <geist> actually says BSD 4 at the bottom yeah
22:37:00 <mrvn> I still think keeping a read() on a socket (and the socket and tcp connection) you closed alive is not desirable.
22:37:00 <mrvn> now I think the only thing left to do is test how bsd and linux actually behave.
22:37:00 <geist> so all this aside i think what we can derive is diffenre posix systems dot handle this consistently
22:38:00 <geist> but since linux is the only thing that matters...
22:38:00 <heat> AMEN
22:38:00 <mrvn> hehe. zircon matters too
22:38:00 <heat> also HP-UX
22:38:00 <geist> i say the last thing with a heavy heart
22:38:00 <heat> only itanium supporting systems
22:39:00 <dh`> in netbsd the text dates back to -r1.1 in 1993 so probably from 4.4-lite
22:39:00 <sham1> closing a file descriptor should cause an EINTR or something like that
22:39:00 <geist> zircon actually has something more subtle: an object can have any number of internal refereces, including just plain user facing handles. buit there *is* a one way signal called on the object when the user handle count goes to zero
22:40:00 <geist> .OnZeroHandles() or something like that o the object
22:40:00 <sham1> Basically to just stop the blocking read and saying "sorry, the file is now closed. Shouldn't have used threads like this"
22:40:00 <geist> so there are some cases where the last user handle going away automatically triggers some sort of internal cleanup even if some iternal references are held
22:40:00 <dh`> as we just spent a long time discussing, that is not guaranteed and not how it's implemented in most places
22:40:00 <mrvn> geist: can you close a socket while it still has references?
22:40:00 <mrvn> (the tcp side)
22:40:00 <geist> in what case?
22:40:00 <mrvn> close()
22:40:00 <geist> i dunno,what OS are you talking about?
22:41:00 <dh`> whether this behavior violates the basic atomicity guarantees is at least debatable
22:41:00 <mrvn> zircon, one thread does read(fd), another does close(fd).
22:41:00 <geist> the kernel doesn't implement file systems or net stack
22:42:00 <geist> but the gist is the other side would see that the last handle to the IPC channel went away and start shutting down
22:42:00 <heat> this is why microkernels are superior
22:42:00 <geist> ie, the network stack gets a signal when the other end is closed (ie, on zero handles to the client side of the IPC channel that the socket is implemented over)
22:42:00 <heat> you avoid all kinds of debates by just not doing it
22:42:00 <geist> really IPC objects are the main users of the OnZeroHandles state, since otherwise you can't tell if the other side hung up
22:43:00 <dh`> in a microkernel environment, what does it even mean to have an operation pending while you close the handle?
22:43:00 <dh`> there are only messages
22:43:00 <geist> and since you cant construct a new handle from zero handles, it's a one way road: once you get to zero handles it's a permanent state
22:43:00 <mrvn> geist: yeah. but tcp sockets are a bit different. they have connections that you can shutdown without the object getting deleted.
22:43:00 <mrvn> (at least in posix)
22:43:00 <geist> in that case a shutdown() would almost certainly be a mesage of the IPC
22:43:00 <geist> dh`: depends on the type of microkernel. zircon is fairly uh... 'rich'
22:43:00 <mrvn> yep.
22:44:00 <geist> in that it is on the larger side of it, but what we do kinda consistently is there are N types of objects and they all operate using the same model
22:44:00 <geist> threads, processes, jobs, ipcs, memory objects, etc
22:44:00 <dh`> or does the system guarantee you a reply paired with your request or something so there is still some kind of pending state?
22:44:00 <geist> so yes you can 'read' from a VM object for example, which is kinda file like
22:45:00 <mrvn> geist: so my mind model woulöd be that close(fd); does send a shutdown IPC message for sockets and then later when the refcount becomes 0 the resources get freed.
22:45:00 <mrvn> and removes the handle from the table
22:45:00 <heat> how do I make concurrent open()'s unsuck?
22:45:00 <heat> or suck less
22:45:00 <geist> yah though in the case where the process simply gets axed and all the handles closed you always have to have a mechanism for serviers in a µkernel world to detect the closing of the other end
22:45:00 <geist> so the built in OnZeroHandles mechanism works for zircon for that
22:45:00 <dh`> define suck in this case?
22:46:00 <geist> ie, in lieu of any explicit shutdown at least the server notices the other side went away
22:46:00 <heat> imagine a fd table with an rwlock, open/socket/dup/whatever that creates a fd needs to write lock
22:46:00 <heat> which Sucks(tm)
22:46:00 <heat> I think most UNIXes have a workaround for this (full blown RCU or other more dollar store solutions)
22:46:00 <dh`> open the object first, only lock the table to scan it and insert
22:47:00 <geist> mrvn: but yeah for a socket style close() (if you're trying to implement POSIX on top of the µkernel) you could do some sort of pending message to it
22:47:00 <heat> yes, but that's still slow
22:47:00 <dh`> (or alternatively, lock the table first to scan it and insert a placeholder, then leave only the placeholder locked while you're working)
22:47:00 <heat> you'll still have a bunch of contention there which will be PESSIMAL
22:47:00 <heat> I remember NetBSD also had a workaround for this
22:47:00 <dh`> unfortunately for unix-style handles you're supposed to guarantee that you return the lowest available slot so you can't avoid the scanning
22:47:00 <geist> mrvn: a lot of it depends on how much you do or dont try to map posix fds to a IPC object. if you did 1:1 it might make sense to just use the IPC channel close semantics to do the same thing as close
22:48:00 <dh`> you can cache it away in some cases
22:48:00 <geist> but if it's something more complicated, where you're multiplexing N fds over M IPC channels then you could build up your own state there in user space
22:48:00 <mrvn> geist: it's more about sockets having a shutdown method separate from the socket object getting destroyed.
22:48:00 <geist> sure
22:48:00 <geist> shutdown() i'd assume would be a message over the IPC channel to the network server
22:49:00 <mrvn> yep. as it is with tcp
22:49:00 <geist> since you're already going to need some messaging scheme for all the other out of band data
22:49:00 <mrvn> files don't have that semantic so I have no idea what read() on a file should do. different expecation there.
22:50:00 <mrvn> but with the IPC mechanism having files and sockets behave the same, i.e. close(fd) sends a shutdown over the IPC connection, it would make sense to have them behave the same.
22:50:00 <dh`> heat: you could imagine something like always keeping the descriptor table dense by allocating placeholder objects for holes and then keeping the placeholder objects on a linked list
22:50:00 <geist> yah part of the sort of half solution of modelling sockets as files in unix
22:50:00 <geist> like, it sort of works except where it doesn't
22:51:00 <dh`> so when you go to allocate you pop the first placeholder off the freelist, and if there isn't any you grab the next table entry
22:51:00 <netbsduser> i do like to allocate placeholder objects
22:51:00 <mrvn> dh`: union { int next_free_fd; file_descr fd; } fds[max];
22:52:00 <dh`> whether this is actually any better than just locking the table and scanning it (especially if you keep track of the start and end points for scanning) is an open question
22:52:00 <dh`> I'd guess not
22:52:00 <mjg> burp
22:52:00 <mjg> lemme tell you something
22:52:00 <heat> mjg, omg hi rick
22:52:00 <mrvn> dh`: both are O(1) if the table have a maximum size.
22:52:00 <netbsduser> it's how i do page-in efficiently: you allocate the page and mark it busy, and abandon all locks, then you wait on an event (to which the page structure points) until the page is in-memory
22:52:00 <dh`> mrvn: that doesn't work well because you need to be able to insert
22:52:00 <heat> TELL ME HOW TO DO LOCKLESS
22:52:00 <mjg> heat: EZ
22:52:00 <dh`> EVERYTHING is O(1) if the size is bounded, that's not useful
22:52:00 <mjg> - spin_lock(&lock);
22:52:00 <mrvn> dh`: insert what?
22:52:00 <mjg> + //spin_lock(&lock);
22:52:00 <netbsduser> so if someone decides to munmap the area, then it just sets a flag in the page saying "you are surplus to requirements, please be freed when this I/O is done"
22:52:00 <mjg> now yo uare LOCKLESS
22:53:00 <dh`> mrvn: freelist entries
22:53:00 <mrvn> dh`: how do you insert an FD between 4 and 5?
22:53:00 <heat> mjg, I'm reading netbsd's fd_expand, etc and I don't get it
22:53:00 <dh`> mrvn: suppose fds 0-2, 4-7, and 8-10 are open and I close 5
22:53:00 <mrvn> dh`: ahh, why? you reuse the last closed FD first.
22:53:00 <heat> it looks like running with scissors atomics version
22:53:00 <mjg> heat: just like with openbsd, i'm not looking at that
22:53:00 <mrvn> nothging says open should get the lowest free FD, right?
22:53:00 <dh`> not in unix you don't, you are required to return the lowest-numbered available fd
22:53:00 <mjg> mrvn: posix says
22:54:00 <mjg> which is a major pita
22:54:00 <mrvn> you want to do POSIX? you have bigger problems. :)
22:54:00 <dh`> posix says, because traditionally there was no dup2 and if you wanted to do I/O redirection you had to rely on that semantic
22:54:00 <heat> mjg, freebsd uses SMR right?
22:54:00 <mjg> whether you want or not what to do posix, this has been the case for decades
22:54:00 <mjg> heat: for what
22:54:00 <mjg> so you can't just change it
22:54:00 <heat> this stuff
22:54:00 <mjg> heat: no
22:54:00 <dh`> realistically these days you're unlikely to break anything by violating that rule
22:55:00 <mjg> there is code which expects the order
22:55:00 <heat> mjg, ok father, then how does it do stuff
22:55:00 <heat> does it handroll some weird RCU too
22:55:00 <mjg> heat: it is all stupid
22:55:00 <mjg> heat: file * objs are *never* actually freed
22:55:00 <dh`> mjg: have you seen any such code in the wild in the last say 10 years? I haven't
22:55:00 <mjg> heat: and file tables only disappear after the proc exits
22:55:00 <heat> god what
22:55:00 <mjg> dh`: i did, the idea is: close all shit, then open /dev/null a bunch of times to fill in 0/1/2
22:55:00 <mjg> heat: GEEZER motherfucker
22:55:00 <dh`> yes, I know the idea
22:56:00 <dh`> where did you see it and why didn't you patch it out?
22:56:00 <mjg> well if you no lnger guarantee lowest fd, you are dead here
22:56:00 <mjg> i don't even remember, does it matter?
22:56:00 <heat> so erm erm erm if I expand the fd table a bunch of times do you keep them all cached?
22:56:00 <mjg> point is there is codde like that out there
22:56:00 <mjg> you can't just change the behavior from under it
22:56:00 <mjg> heat: not if the process is single threaded, otherwise yes
22:56:00 <mrvn> anyway, you can make it a sorted doubly linked list.
22:56:00 <dh`> it matters because if you decide to break that rule you want to know what the probability is of hitting something that doesn't work
22:57:00 <mjg> heat: it doesn ot have to be like that, rcu or not, but here we are. mostly becaues geezer
22:57:00 <mrvn> or just keep a pointer to the lowest free FD and search from there.
22:57:00 <heat> god.jpeg
22:57:00 <heat> netbsd seems to do similar
22:57:00 <mjg> yes, the idea was tkaen from netbsd
22:57:00 <mjg> it is all geezer
22:57:00 <heat> now I want to see what OpenBSD does
22:58:00 <mjg> :DD
22:58:00 <mjg> brah
22:58:00 <heat> i bet 10 on BKL
22:58:00 <mjg> openbsd is doing turbo stuipd
22:58:00 <mjg> here is a funny story
22:58:00 <heat> hey no spoilers!
22:58:00 <mjg> traditionally unix would allocate fds by traversing an array
22:58:00 <mjg> bsds including
22:58:00 <mjg> openbsd was the first bsd to implement a bitmap, in fact two level
22:58:00 <mjg> some time later the rest followed suit
22:59:00 <mjg> exdcept freebsd has one level with no explanation why not two
22:59:00 <mjg> all of which referred to the same paper
22:59:00 <mjg> so sounds like obsd has a leg up, or at least did...
22:59:00 <geist> okay, again.
22:59:00 <heat> what paper?
22:59:00 <mjg> except apart from the bitmaps they *still* do linear array scans
22:59:00 <mjg> give me a minute
23:00:00 <zzo38> I think there is something wrong with the wiki. Even if you use real mode does not necessarily mean that you have to use the PC BIOS, and it is not necessary to use the PC BIOS for all operations even if you have it available.
23:00:00 <mjg> heat: Scalable kernel performance for Internet servers under realistic loads
23:00:00 <mjg> heat: guarav banga & co
23:01:00 <geist> zzo38: this is true. is there a good example of this?
23:01:00 <zzo38> It is true that some of the hardware features are a bit messy due to compatibility (such as the A20 gate), but some of the things still can be sensible for some kinds of systems.
23:01:00 <geist> i can imagine there's stuff that goes out of its way to use bios calls to write to the text display
23:01:00 <zzo38> Also, UEFI is even more messy and even more worse, in many ways.
23:02:00 <netbsduser> this is why i keep well away from both
23:02:00 <zzo38> BIOS calls are probably most useful during the initial booting to read the kernel and drivers from the disk; after that, presumably you will have better drivers suitable for your system.
23:02:00 <geist> that's the idea, yeah
23:03:00 <mjg> heat: btw the paper incorrectly claims the approach is logarithmic
23:03:00 <mjg> heat: kind of funny
23:03:00 <heat> mjg, open seems to have copied net too
23:03:00 <mjg> heat: too bad they did not bench vs single-level bitmap
23:03:00 <mjg> in what regard
23:03:00 <mjg> *bitmaps* were first in openbsd afair, it was the rest which copied from there
23:03:00 <mjg> obsd got it in 2003 or so
23:04:00 <mjg> very positively surprised with dtrace: dtrace -n 'fbt::memset:entry { printf("%d %d", cpu, arg2); }' -o /tmp/memsets
23:04:00 <mjg> per-cpu trace of all calls with 0 drops
23:04:00 <mjg> very nice
23:04:00 <heat> also fyi linux also does single-level I think
23:04:00 <mjg> no, linux got 2 level ~7 years ago
23:04:00 <zzo38> Also, the PC BIOS provides the booting function, and UEFI is too complicated in that way. Furthermore, I think it is not legitimate to be called "PC" if the PC BIOS is not implemented. (I do have an idea about how to design a better booting system in ROM, but it is not a PC booting system but it would be possible to implement both if it is desirable)
23:05:00 <mjg> heat: the real interesting bit is solaris which has a *tree* instead
23:05:00 <mjg> heat: dfly copied from there
23:05:00 <mjg> i have no idea how that perform
23:05:00 <mjg> s
23:06:00 <dh`> two layers is still a tree
23:06:00 <zzo38> I think that HDMI and USB also is no good
23:06:00 <mrvn> a stunted tree
23:06:00 <mjg> guaranteed 2 layers no matter what
23:06:00 <mjg> is not
23:06:00 <heat> how does a 2-layer bitmap work?
23:06:00 <mjg> cmon dwag
23:06:00 <mrvn> heat: top layer bit says if there is a leaf for the 2nd level bitmap
23:06:00 <mjg> read some openbsd!
23:06:00 <geist> i asssume you just have a top layer bitmap that determines if there are holes in blocks of lower level nodes
23:06:00 <mjg> right
23:07:00 <mjg> that's it, literally 0 magi
23:07:00 <mjg> c
23:07:00 <dh`> however you want, but my guess would be that each bit in the lower layer indicates the state of one fd entry and each bit in the upper layer indicates whether there's a free bit in each word of the lower layer
23:07:00 <dh`> because with 32-bit words and 1024 fds max that all fits tidily
23:07:00 <heat> right
23:07:00 <geist> yah that's what i'd think. you could do something more complicated like a bit that signifies if the entire sub tree is occupied or not
23:08:00 <netbsduser> just flicked Solaris Internals open to the page on the fd tree, funny, they have a comment on exactly what people were chatting about earlier on colliding read() and close()
23:08:00 <dh`> but it seems stupid given that the granularity of the upper layer should be a whole cache line of the lower layer
23:08:00 <mrvn> So 2 find_lowest_zero() calls give you the FD you can use.
23:08:00 <geist> it's all because of the stupid property that fds are first fit
23:09:00 <geist> and the source of a whole class of bugs and exploits
23:09:00 <dh`> and furthermore, each entry in the lower layer may as well represent a whole cache line's worth of the table itself
23:09:00 <heat> netbsduser, why do you have a Solaris Internals
23:09:00 <geist> i have that book too
23:09:00 <heat> do you have an Internals for every SVR4 descendent
23:09:00 <geist> it's quite well written
23:09:00 <netbsduser> heat: i like to read about other OSes to appreciate them + ruthlessly steal ideas i like
23:09:00 <mrvn> geist: AMEN, open() should return the next free FD with 64bit rollover.
23:09:00 <heat> how's the STREAMS
23:10:00 <geist> mrvn: we explicitly randomized handles in zircon to avoid this stuff
23:10:00 <dh`> <mrvn> yeah I want a fdtable of size 2^64
23:10:00 <mrvn> dh`: you hash that
23:10:00 <heat> geist, but randomized handles forces you to use a tree which sucks
23:10:00 <geist> not necessarily
23:10:00 <mrvn> geist: so nobody can guess a handle?
23:10:00 <heat> you can't do anything remotely flat can you?
23:11:00 <geist> mrvn: correct. and more importantly if you close a handle it wont get reused quickly
23:11:00 <geist> heat: depends on how good you 'randomize' it.
23:11:00 <netbsduser> it's coming along, i want to figure out whether i can implement a unified low-level module which pipes/fifos, ttys, etc can all use
23:11:00 <mrvn> heat: you can hash the handle down to a small int and choose your random handles so the hash doesn't collide.
23:11:00 <geist> basically we feed it through a hash and put some salt in it so that the same slot doesn't net the same id
23:11:00 <mrvn> heat: creating a handle might have to try a few times.
23:11:00 <geist> at the end of the day it is indeed slots in a table, but the process sees it hashed
23:12:00 <heat> oh ok, so the handles aren't indices?
23:12:00 <geist> it's not cryptographically perfect. yuou can guess, but the main points is to avoid reusing handles quickly
23:12:00 <geist> so that most use-after-free bugs are caught
23:12:00 <dh`> just allocating sequentially mod the table size serves that purpose well enough
23:12:00 <geist> they're 'random' to the process, though post hash + salt they are indeed just indices
23:13:00 <dh`> (like with process ids, random process ids still seem like a stupid idea)
23:13:00 <mjg> geist: you *randomize* fd for security? am i misreading something?
23:14:00 <geist> basically. though zircon doesnt have fds per se. but it's the handle table
23:14:00 <geist> less of security and more of a bug catching thing
23:14:00 <geist> ie, handles take a very long time to get recycled
23:14:00 <mrvn> mjg: I think more a mitigation against bad code
23:15:00 <geist> we have some additional constraints you can put on a process to cause them to instantly abort if a bad handle is used
23:15:00 <geist> that catches a ton of things
23:15:00 <mjg> do you have dup2-equivalent?
23:15:00 <geist> no
23:15:00 <moon-child> imagine fd is stored in memory and buffer overflow corrupts it
23:15:00 <mjg> geist: ye that is a real problem
23:15:00 <moon-child> you're better off if malicious actor can't control which fd it turns into
23:15:00 <geist> you can absolutely not in any circumstances create a handle at a known value
23:15:00 <mjg> there are known multithreaded progs which use fds as they are being closed
23:15:00 <mjg> untintentionally
23:15:00 <moon-child> I heard the following anecdote: somebody forked, closed stderr, and then mapped some memory
23:15:00 <moon-child> then wrote a log message
23:15:00 <mjg> there was a bug in freebsd once which broke them
23:15:00 <mjg> kind of funny
23:15:00 <moon-child> mapped memory reused the stderr fd
23:16:00 <moon-child> so log message stomped mmap
23:16:00 <heat_> are we looping
23:16:00 <mjg> :d
23:16:00 <geist> we explicitly designed the handle mechanism to try to deal with this whole known set of posix issues with fd recycling and whatnot
23:16:00 <geist> works pretty well
23:16:00 <netbsduser> moon-child: that's appalling
23:16:00 <netbsduser> where did that happen
23:16:00 <zzo38> I would have solved it by making file descriptors that are not explicitly given a number to have a minimum file descriptor number; if you want a lower number then you must explicitly request it.
23:16:00 <moon-child> arcan
23:17:00 <mrvn> zzo38: that's even worse. Now all libraries compete for low numbers.
23:17:00 <geist> iirc QNX did somethingl ike putting all posix fds in positive space, and all other handles to QNX specific stuff in negative space (bit 31)
23:17:00 <dh`> you can't have both well-known addresses and a scheme for avoiding well-known addresses
23:17:00 <geist> or something along those lines, so the kernel can use different allocation strategies
23:17:00 <dh`> I can't imagine that would work since < 0 being invalid is baked in everywhere
23:17:00 <mrvn> dh`: you can pass the "well-known" addresses as arguments to a process.
23:18:00 <geist> idea is that for internal qnx stuff that's not doing posix, the negative handles are *bad*
23:18:00 <geist> so if they do leak out to posix space they wont work
23:18:00 <dh`> they'd have to audit pretty much every open for only testing -1 explicitly isntead of < 0
23:18:00 <geist> qnx being a microkernel, it's implementing posix in user space
23:18:00 <netbsduser> geist: clever trick, i might have to imitate that
23:18:00 <mjg> geist: my seal of approval
23:18:00 <dh`> I guess
23:18:00 <zzo38> mrvn: Well, normally 0 is used for stdin, 1 for stdout, 2 for stderr. Libraries shouldn't need to compete for low numbers, since they are only used for standard I/O anyways, I think
23:18:00 <dh`> mrvn: you can but there are various costs to that
23:19:00 <heat> mjg's seal of approval is RARE
23:19:00 <geist> with some caveats being that they have some affordance for the kernel to directly map some ipc channels to fds, and in those case the fd-to-handle mapping is 1:1
23:19:00 <mrvn> zzo38: but we don't have 0, 1, 2 anyway so that point is moot.
23:19:00 <geist> and for everything else, handles to things that are meaningless to posix, they're in a different namespace, basically
23:19:00 <mjg> heat: true mjg!
23:19:00 <mjg> heat: true mjg@
23:19:00 <heat> ok mjg@
23:19:00 <geist> negative, not negative, doesn't matter. idea is the namespacing really
23:19:00 <netbsduser> i do know of some software which uses any means necessary to find out all open fds in a process and close them, but i suppose you can simply hide them from any posixy ways to find that out
23:19:00 <zzo38> However, my own (currently unnamed) operating system design does not have file descriptor numbers (although it can be emulated, if required for POSIX capabilities).
23:19:00 <netbsduser> (namely systemd uses linux's procfs to find them out)
23:20:00 <heat> netbsduser, FreeBSD has a syscall for that
23:20:00 <heat> and so does linux now
23:20:00 <mrvn> netbsduser: lots of software does that. Modern software should use CLOEXEC and the posix call to iter over the fds.
23:20:00 <netbsduser> heat: is there a specific syscall or is it via sysctl?
23:20:00 <geist> i suppose it'd be easily enough to implement some sort of close_range() call
23:20:00 <heat> syscall, close_range in both
23:21:00 <mrvn> netbsduser: scanning procfs fails with threads.
23:21:00 <heat> and closefrom in the libc I think
23:21:00 <geist> youc an onyl make a best effort. even close_range() would intrinsically race with opens in another thread
23:21:00 <geist> but you define most likely that it makes one pass, and closes them in a particular order
23:21:00 <netbsduser> systemd wants to close everything not on a whitelist it creates of acceptable fds, so i am not sure whether a close_range would work for it
23:21:00 <geist> syuch that races with any other threads are at least somewhat predictable
23:21:00 <zzo38> I do not have a name for my operating system design, so far
23:22:00 <zzo38> What operating systems were you designing and do you have any link of documentation?
23:22:00 <heat> netbsduser, sure does, use the gaps
23:22:00 <dh`> I don't think anyone here much cares what silly things systemd does
23:22:00 <mrvn> geist: posix_spawn can make it atomic
23:22:00 <mrvn> or you close after fork()
23:22:00 <netbsduser> dh`: i need to for the sake of a pointless publicity stunt
23:22:00 <heat> doing close_range for all the gaps is probably still a good bit faster than looping through fds and closing
23:23:00 <geist> right, where the multithreading isn't an issue
23:23:00 <netbsduser> porting systemd to my kernel would make excellent hackernews bait
23:23:00 <geist> because post fork it's just a single thread
23:23:00 <heat> lol
23:23:00 <geist> (until yo ustart making new ones)
23:23:00 <mrvn> and you really shouldn't close random FDs before the fork()
23:23:00 <heat> didn't you port systemd to BSD?
23:23:00 <heat> or do I have the wrong guy
23:23:00 <netbsduser> i did, it was mostly for the same reason
23:24:00 <heat> ah, you do like the headlines
23:24:00 <geist> hah elaborate ways to get social headlines huh
23:24:00 <geist> i suppose that checks out
23:24:00 <netbsduser> i have an insatiable inner troll but i couldn't bear to do it the old-fashioned way with incendiary posts to forums and suchlike
23:24:00 <heat> you should port glibc to BSD now
23:24:00 <mrvn> Which actually brings me to a problem I had at work on friday: How do you get the highest FD.fileno that's open under python?
23:25:00 <heat> and then coreutils
23:25:00 <netbsduser> doing weird things with software is much more professional
23:25:00 <mrvn> heat: Debian kfreebsd
23:25:00 <netbsduser> heat: glibc did have a freebsd port at one point
23:25:00 <heat> i know
23:25:00 <netbsduser> at least formally it's a retargetable libc, i know someone is trying to bring it to managarm
23:25:00 <heat> so did debian
23:26:00 <heat> i have an in-progress port to Onyx
23:26:00 <netbsduser> they are having big trouble with its native posix threading library, which is very linux
23:26:00 <heat> it's a good libc
23:26:00 <heat> bah, nptl is fine
23:26:00 <heat> i hacked musl's nptl stuff and glibc isn't that much harder
23:27:00 <heat> you can also just implement your own separate library because glibc is ofc completely configurable
23:27:00 <netbsduser> in my experience i've found gnu stuff is often surprisingly portable, who else (but perl) checks for dynix, eunice, and the windows subsystem for posix applications in their configure scripts?
23:28:00 <geist> glibc, yeah that's been ported to all sorts of non linux things
23:28:00 <mjg> sorry to intjerect, do you have a minute to flame about memset?
23:28:00 <geist> haiku uses it, and back in the day BeOS did
23:28:00 <heat> yes gnu stuff is Great(tm)
23:28:00 <mjg> i got a real trace, all memsets made during build kernel, for each cpu
23:28:00 * dh` chuckles politely
23:28:00 <heat> supports all kinds of crap systems
23:28:00 <mjg> and a prog to execute them
23:28:00 <netbsduser> i never knew they were using glibc at haiku
23:29:00 <mjg> heheszek-read-current 148708742 cycles
23:29:00 <mjg> heheszek-read-bionic 98762683 cycles
23:29:00 <mjg> heheszek-read-erms 233876267 cycles
23:29:00 <mrvn> mjg: do you have a histogram of sizes?
23:29:00 <mjg> bionic "cheats" by using simd
23:29:00 <netbsduser> i know vmware esxi uses it (but i'm not sure if they just implemented a crudimentary linux abi compatibility)
23:29:00 <mrvn> mjg: how often is memset called to bzero?
23:29:00 <mjg> literally every time
23:29:00 <mjg> anyhow, as you can see, erms is turbo slower
23:29:00 <heat> mjg, ok, what's the point
23:29:00 <mrvn> literally or every time? Is there even one exception?
23:30:00 <mjg> mrvn: not in producton
23:30:00 <mjg> debug has it for poisnoning
23:30:00 <mjg> heat: what's the point of what
23:30:00 <mrvn> mjg: with a byte value or 32/64 bit pattern?
23:30:00 <heat> what's the big revelation in these results?
23:31:00 <heat> rep stosb bad, simd good, current ok?
23:31:00 <mjg> heat: there is no revelation, just confirmation erms crapper
23:31:00 <mjg> heat: and more importantly now there is a realworld-er (if you will) setup to bench changes to memset
23:31:00 <heat> where
23:31:00 <mjg> on my test box!
23:31:00 <heat> is this Proprietary(tm)
23:32:00 <mjg> not-heat licensed
23:32:00 <mjg> look mate, the code looks like poo right now
23:32:00 <mjg> i'm gonna play around with memset, clean that up and then publish somewhere
23:32:00 <heat> cool
23:33:00 <mjg> will be useful for thatl inux flame thread
23:33:00 <heat> no one flamed man
23:33:00 <mjg> note there was one major worry here: that there is branch prediction misses
23:33:00 <heat> how is that a flame thread?
23:33:00 <mjg> with ever changing sizes
23:33:00 <mjg> heat: see my previous remark about polack word choice
23:34:00 <heat> that thread is probably that tamest the lkml has ever been
23:34:00 <heat> particularly since linus likes you so much
23:34:00 <mjg> i don't think he does mate, but senkju
23:34:00 <heat> you're way better than the other mjg
23:34:00 <mjg> i'm going to generate more traces, including from linux
23:34:00 <mjg> for memset, memcpy and copyin/copyout
23:35:00 <mjg> then we will see what happens
23:35:00 <heat> geist, hello sir do u have time to run something on one of your ryzens?
23:36:00 <mjg> heat: do you have a memset?
23:36:00 <heat> no
23:36:00 <mjg> aight, no biggie
23:37:00 <heat> i wanted to try borislav's "rep movsb is totally good on amd" claim
23:37:00 <mjg> which amd tho
23:37:00 <heat> recent probably
23:37:00 <mjg> right
23:37:00 <heat> everything was bad on bulldozer
23:37:00 <mjg> it may not even be on the market
23:38:00 <mjg> even so, i have to note the typical way of benchmarking string ops by calling them in a loop with same size stuff can be quite misleading
23:38:00 <mrvn> heat: you should make the kernel/libc benchmark memcpy/memset at start and pick the fastest for the actual cpu.
23:38:00 <mjg> for example, if yout ained the branch predictor, a 32-byte loop is way faster than erms for sizes past 256 bytes even
23:38:00 <mjg> but this goes down the drain if you get misses
23:38:00 <mjg> tradeoff city
23:39:00 <mjg> in fact you may get slower
23:39:00 <heat> yes but don't forget this is all microbenchmarking
23:40:00 <mjg> north remembers
23:40:00 <heat> does any of this REALLY matter on a real workload? probably not
23:40:00 <mjg> ha
23:40:00 <mjg> wrong!
23:40:00 <heat> maybe slightly
23:40:00 <mjg> lemme find it
23:40:00 <mjg> well it mostly does not once you reach basic sanity
23:40:00 <heat> it's like the age old "just use rep movsb/q/l/w, cuz icache"
23:41:00 <mjg> i got numbers showing *tar* speed up after unfucking the string ops
23:41:00 <mjg> they used to be rep stosq ; rep stosb
23:41:00 <mjg> and so on
23:41:00 <mjg> absolute fucking massacre for the cpu
23:42:00 <mjg> bummer, can't find it right now
23:42:00 <heat> yeah but tar is just a fancy exercise in memory copying isn't it
23:42:00 <mjg> but bottom line, the really bad ops were demolishing perf
23:42:00 <heat> read(...) + write(...)
23:42:00 <mjg> handsome, tar was doing a metric fuckton of few byte ops
23:42:00 <mjg> not the actuald ata extraction
23:42:00 <mjg> and this was most affected
23:42:00 <heat> did you just call me handsome
23:43:00 <mjg> it is my goto insult
23:44:00 <heat> it's the harshest canadian insult after all
23:44:00 <mjg> so the jury is still out
23:44:00 <mjg> i *randomized* tons of sizes and fed them
23:44:00 <mjg> into the bench
23:45:00 <mjg> this makes erms faster *sometimes* and it is all because of branch mispredicts
23:45:00 <mjg> 19% for current memset, 4% for erms
23:45:00 <mjg> i note real-world trace has a win because the calls tend to repeat
23:46:00 <mjg> but in principle there may be another workload where the above happens instead
23:48:00 <zzo38> Are there wiki pages relating to such things as capability-based security?
23:49:00 <mjg> not that i know of, but one thing to google is: capsicum
23:50:00 <heat> and fuchsia