channel logs for 2004 - 2010 are archived at http://tunes.org/~nef/logs/old/ ·· can't be searched
#osdev2 = #osdev @ Libera from 23may2021 to present
#osdev @ OPN/FreeNode from 3apr2001 to 23may2021
all other channels are on OPN/FreeNode from 2004 to present
http://bespin.org/~qz/search/?view=1&c=osdev2&y=23&m=3&d=6
00:06:00 <mjg> and here is another trace:
00:06:00 <mjg> heheszek-read-bionic 62619868 cycles
00:06:00 <mjg> heheszek-read-erms 192876429 cycles
00:06:00 <mjg> heheszek-read-current 108545984 cycles
00:06:00 <mjg> heheszek-read-broken 33159219 cycles
00:07:00 <mjg> the -broken variant is a func which does nothing
00:07:00 <mjg> you may notice a func which does nothing is using 1/3rd of the time memset-current is
00:07:00 <mjg> also -erms almost twice as slow
00:10:00 <mjg> need to name it better
00:11:00 <mrvn> mjg: and that still works?
00:11:00 <heat> stosb
00:11:00 <mjg> mrvn: ? the broken thing was only added for comparison purposes
00:12:00 <mjg> to see how much is spent on mere fact there is a func call
00:12:00 <mrvn> mjg: but I thought you were running some real world code. I would expect that to blow up and fail early.
00:15:00 <mjg> i'm running *real* sizes, in the order they were obtained in a real workload, in a loop against various memsets
00:15:00 <mjg> and checking total time
00:23:00 <mrvn> ahh. That won't have the same cache behavior though
00:30:00 <zzo38> Consider to add B-Free into the list of abandoned projects. It is available on GitHub (and I have forked the project, but the only change I made is to add 64-bit types)
00:31:00 <mrvn> and someone should care about that why?
00:33:00 <zzo38> It is a FOSS implementation of BTRON. Such things are difficult to find, and someone who has interest in BTRON should hopefully try to improve it.
00:34:00 <mrvn> zzo38: so add a link to it on https://en.wikipedia.org/wiki/BTRON
00:35:00 <zzo38> Is it notable enough for Wikipedia?
00:37:00 <zzo38> Do you have further comments about my own operating system design? Just now, I added some more stuff to the design documentation.
00:39:00 <mrvn> If you have to ask for validation then you won't get it.
00:39:00 <zzo38> This system has a hypertext file system, capability-based security with proxy capabilities, locks and transactions on groups of objects as a unit, a common file format, and others.
00:40:00 <zzo38> I hope that if I had made any mistakes or unclear, to fix it.
00:41:00 <zzo38> I am also reading the forum to see if there is any interesting stuff mentioned in there
00:41:00 <heat> you should write it first
00:41:00 <heat> IMO not much point in designing what you haven't even tested
00:43:00 <zzo38> OK, it is a valid complaint, although first I want to write how I intend to design it, in the design documents, and then an implementation can be written and the parts of the design documentation changed as needed while finding some things are problems, but possibly some things can be found before implementation in case to make the implementation less messy
00:45:00 <mrvn> but if we tell you how will you ever learn?
00:45:00 <mrvn> what is a problem is also often subjective.
00:47:00 <zzo38> OK, but a collaborative design is also possible. It is true that what is a problem can be subjective, I suppose; I have found what seems to me to be problems with some other designs
00:47:00 <zzo38> There is also possibility of such things as unclear documentations, etc
00:54:00 <mrvn> s/possibility/inevitability/
00:58:00 <geist> problem with a lot of this stuff is when you're first getting started there's lots of things you dont know you dont know. so if you try to build some comprehensive design you'll be missing large sets of things that you really should be thinking about
00:58:00 <geist> so the general strategy that works is to start by doing, then as you do things you'll learn more of what you dont know
00:58:00 <geist> it gives you a framework to start putting more knowledge on top of
00:58:00 <geist> over time you start to get a better handle on all the things that do and dont matter with larger designs
00:59:00 <geist> this is the same reason i dont like to just info dump on new folks that come in and ask questions. you have to do it in phases so they can learn the framework to attach more knowledge too later
00:59:00 <mrvn> also when you start with the perfect design you will need years to get anywhere and 99.999% of people give up. So it's not really worth investing time in such people. Start small and learn.
01:00:00 <geist> otherwise you're just dumping info on someone. like many bad professors at university
01:02:00 <mrvn> and with that I bid you good night.
01:03:00 * geist is moving around the house with laptop, trying to find the best place to keep from coughing too hard
01:03:00 <geist> have had a shitty cold for the last 5 days, i think it peaked on friday with a fever, but now it's mostly just moving towards chest congestion
01:04:00 <geist> and i *hate* coughing
01:11:00 <heat> i feel you
01:12:00 <heat> was also under the weather a week ago
01:32:00 <zzo38> I think that it is worth to design both low-level and high-level stuff, and such thing also should be implemented in wiki
04:41:00 <Jari--> morning all
04:56:00 <geist> morn
07:01:00 <klys> user: mdasoh hostname: whatever model: gigabyte mz72-hb0 v3.0 distro: debian unstable/sid gnu/linux amd64 kernel: linux 6.0.0 uptime: up 5 days, 4:24, 8 users, load average: 0.06, 0.15, 0.26 processes: 637 kvm: 2 running virtual machines window manager: openbox desktop environment: none shell: /bin/bash terminal: konsole 4:22.12.0-2 packages: 4248 temperature: not supported by kernel cpu: AMD
07:02:00 <klys> EPYC 7453 28-Core Processor @ 2750.095 MHz gpu: NVIDIA Corporation GM206 [GeForce GTX 960] (rev a1), reserved to vm running ms windows 7 professional 64-bit resolution: 1366x768 via magewell usb capture hdmi ram: 242958.6 free / 257617.4 total MiB ssd: 1154342732 free / 1377099228 total disk: 12692037120 free / 13371027980 total reserved to vm running debian testing/bookworm gnu/linux amd64
07:11:00 <geist> klys: grats
07:12:00 <klys> geist, sure, thx
08:14:00 <geist> well, got Lk kinda running on the vision five board. having trouble getting the uart to fire irqs, but will leave that to another day
08:14:00 <geist> sadly the way the plic is set up and the uart are completely undocumented on the JH7110 soc, but the device tree tells me they're pretty darn standard
08:15:00 <geist> so it's probalby just some register i have to frob somewhere
08:27:00 <geist> oh i see what it is, it's the stupid hart 0 misaligning the contextx in the PLIC
08:39:00 <klys> w00t
08:40:00 <sham1> w00t w00t
08:40:00 <GeDaMo> wಠಠt :|
08:47:00 <geist> sweet. seems to be working fine now
08:48:00 <klys> so nothing is wrong?
08:48:00 <geist> nah just had the mappings off. it's a stupid PLIC driver problem that i need to solve once and for all
08:49:00 <geist> i keep copying a PLIC driver for every riscv target and hacking it, because the mapping of cpu # (hart) to interrupt target (the plic's notion of a cpu) is not uniform
08:49:00 <geist> annoyingly so. the vf2 has the same 'hart 0 is a machine mode only cpu that isn't startbed by default' thing
08:50:00 <geist> so it uses up just one interrupt target on the plic, but all the other harts (1 - 4) use two
08:50:00 <geist> so the interrupter targets are offset by one
08:50:00 <geist> because for every cpu in the PLIC that has both M and S mode, it has two banks of registers, but only one bank for cpu 0
08:50:00 <geist> basically.
08:50:00 <geist> the device tree describes this which is the Real Solution to it
08:51:00 <geist> anyway when i copy pasted it i didn't fix it up properly here
08:52:00 <geist> i knew it was a prlbme but i thought i had tweaked it, but i insufficiently did
08:53:00 <klys> that sounds p.awesome, so you have visionfive support...
08:53:00 <geist> yah still need to add the secondary birngup code
08:53:00 <geist> but i'll do that tomorrow
08:54:00 <geist> it's not all that exciting to be honest, i had done bringup on a proper sifive unleashed board a long time ago, and this is all in all pretty close
08:54:00 <geist> and have been running on qemu riscv whcih is also fairly close
08:54:00 <geist> but it's nice to see it running on physical hardware again. i gave away my unleashed board years ago so haven't had a good physical riscv rv64 machine to run on in a while
08:54:00 <GeDaMo> This? https://www.starfivetech.com/en/site/new_details/976
08:54:00 <bslsk05> www.starfivetech.com: 新闻详情页
08:54:00 <GeDaMo> "RISC-V SBC VisionFive 2 Officially Shipped"
08:55:00 <geist> yep. got mine last week. it's actually pretty decent
08:55:00 <GeDaMo> Neat
08:55:00 <geist> i mean not really worth the money from a performance point of view, but it's usable
08:55:00 <geist> for $100 you get a half decent riscv machine (think cortex-a53) with 2 ethernet ports and 8GB ram.
08:56:00 <geist> i wouldn't try running any UIs on it though. i think the gpu stuff is in terribad shape
08:56:00 <geist> but as a headless linux box or something to hack on it's pretty straightforward
09:16:00 <zid> is there a complicated german word for "I just found the best meme but it requires knowledge of a very specific thing to be understandble and I am now sad"
09:18:00 <dminuoso> Memekontextverständnissmangeltrauer
09:18:00 <dminuoso> Oh, drop one of the s in the middle
09:19:00 <dminuoso> Memekontextverständnismangeltrauer
09:19:00 <dminuoso> zid: does that work for you?
09:19:00 <zid> Thanks dminuoso
09:20:00 <dminuoso> Replace trauer with kummer, that sounds a bit cuter
09:20:00 <dminuoso> Memekontextverständnismangelkummer
09:20:00 <dminuoso> Yes that.
16:37:00 <mjg> Manually-specified variables were not used by the project:
16:37:00 <mjg> LIBC_BUILD_AUTOMEMCPY
16:37:00 <mjg> CMake Warning:
16:37:00 <mjg> there goes https://github.com/llvm/llvm-project/tree/main/libc/benchmarks/automemcpy
16:37:00 <bslsk05> github.com: llvm-project/libc/benchmarks/automemcpy at main · llvm/llvm-project · GitHub
16:48:00 <mrvn> mjg: so now you have the prooven best memcpy?
16:49:00 <mjg> no, now i have complaints abut the usual problem: a howto not working
16:50:00 <mrvn> Does it output source code one can still read?
16:51:00 <mjg> the machinery to autogen stuff does not build, see above
17:18:00 <heat> mjg, omg its our neighbourhood freebsd developer
17:18:00 <heat> hi!
17:18:00 <mjg> edited some cmakes, got the entire thing to try to build but it fails
17:18:00 <heat> how tf are you building it
17:18:00 <mjg> > /libc/benchmarks/automemcpy/lib/Implementations.cpp:6:10: fatal error: 'src/string/memory_utils/elements.h' file not found
17:18:00 <mjg> there is no file named elements.h anywhere nor anything like it
17:18:00 <heat> also btw pcm (PECIMAL) fixes got merged
17:19:00 <mjg> heat: dude i literally tried to follow the instruction
17:19:00 <heat> i have an elements.h in my local tree
17:19:00 <mjg> what's your version
17:19:00 <heat> old, c9faea04b1f8ef658ee5367ba8f00266b2051263, dated may 6
17:19:00 <mjg> let's try
17:19:00 <mjg> maybe it is autogened?
17:20:00 <heat> no
17:20:00 <heat> it's in ./libc/src/string/memory_utils/elements.h
17:20:00 <heat> (and I don't do in-tree builds)
17:21:00 <mjg> i configrm the file is there if i go back to that commit
17:21:00 <mjg> see top of main
17:22:00 <mjg> commit 534f4bca58f856eaecfcf4a698e7e6b2470349e4
17:22:00 <mjg> Author: Guillaume Chatelet <gchatelet@google.com>
17:22:00 <mjg> Date: Tue Oct 25 11:09:59 2022 +0000
17:22:00 <mjg> [libc] remove mem functions dead code
17:22:00 <mjg> whacked in this commit
17:22:00 <heat> WHACKED
17:23:00 <heat> ok now that we finally tracked down the issue
17:23:00 <heat> why is spinlock_enter, etc not inline?
17:23:00 <mjg> geezer
17:23:00 <heat> I touched spinlock stuff last evening
17:23:00 <mjg> the code is pretty atrocious so it maeks sense to not be inline
17:23:00 <heat> let me gist you this shit
17:24:00 <heat> i'm not convinced that the codegen is that good
17:24:00 <heat> after looking at linux, I'm probably missing some stuff
17:25:00 <mjg> CMake Error at /root/repos/llvm-project/libc/CMakeLists.txt:116 (message): entrypoints.txt file for the target platform 'freebsd/x86_64' not found.
17:25:00 <mjg> sigh
17:26:00 <mjg> i'm done with this for the day
17:26:00 <heat> mjg, https://gist.github.com/heatd/76fa59893a76cc38a60962a9c67cf4b2
17:26:00 <bslsk05> gist.github.com: module_add_disasm.S · GitHub
17:26:00 <heat> tell me what you think
17:27:00 <heat> module_add is a pretty trivial hand rolled single linked list insert with a spinlock over it (which is why i'm using it as a codegen "benchmark")
17:28:00 <heat> doing dec and then re-fetching the value + test is pretty weird but I don't have condition code fuckery like linux does for this (yet)
17:29:00 <heat> the pushf; pop; test is also bad
17:29:00 <mjg> you preemption code remains total crap
17:30:00 <mjg> freebsd, with all its flaws, can do one branch and that's it
17:30:00 <heat> ideally you could just use the preemption counter as a way to gauge if I need to/can reschedule
17:30:00 <mjg> you don't want to check if you can reschedule, normally you ahve to assume the counter goes to 0
17:31:00 <mjg> what you want is to mark somehow that shit needs to be doe
17:31:00 <mjg> and check for that
17:31:00 <mjg> the easiest way is to have another var
17:32:00 <mjg> you would get a failing grade in autopreempt
17:36:00 <heat> so linux stuffs a bunch of crap in preempt_count
17:36:00 <heat> the actual preempt count is a tiny portion of the whole var
17:37:00 <heat> the idea would be to stick "thread needs preempt" and "needs softirq" in the top bits
17:37:00 <heat> and even then, I'm not sure I need a "needs softirq" in preempt_enable code
17:38:00 <heat> because softirqs should only be raised in hardirq context, and if so there's only one exit point
17:39:00 <heat> the test for irq-on I'm not too sure I need as well
17:40:00 <heat> like, having a spin_lock() and then a spin_lock_irqsave is valid, and nothing will malfunction in this case. having a spin_lock_irqsave and then a spin_lock is not valid
17:40:00 <heat> right?
17:46:00 <mjg> simplifying but yes
17:47:00 <mjg> sigh!
17:47:00 <mjg> while i don't have the auto thing operational, it got far enough to generate Implementations.cpp
17:48:00 <mjg> and this is where i'm deeply disappointed with the paper
17:48:00 <mjg> *every* *single* *one* imlementation rolls with increasing size checks
17:48:00 <mrvn> where are the primitives it uses to build a memcpy?
17:48:00 <heat> their whole premise is very farf etched anyway
17:48:00 <mjg> not even one rolls with the approach found elsewhere
17:48:00 <heat> "lets write a memcpy in C++"
17:49:00 <mjg> it is definitely way less than i thought it would be
17:49:00 <heat> i am very much not convinced that writing a memcpy in C++ is feasible
17:49:00 <mrvn> heat: why not?
17:49:00 <mjg> i checked the asm, it is ok, provided the code is not doing the stupid
17:49:00 <mjg> for example if(size < 8) return splat_set<HeadTail<_4>>(dst, 0, size);
17:49:00 <heat> you can't legally do overlapping stores
17:50:00 <mjg> compiles to an ok overlapping store
17:50:00 <heat> how?
17:50:00 <mjg> it *does* compile to it
17:50:00 <mjg> don't ask me how, i don't speak c++
17:50:00 <heat> how could you even do it in C?
17:50:00 <mjg> in c i don't think you could
17:50:00 <mrvn> your code will only ever be as good as your compiler. but they are pretty good.
17:50:00 <heat> well C++ is the same shit
17:50:00 <mjg> aha!
17:50:00 <mjg> builtin::Memset
17:50:00 <heat> unless __attribute__((aligned(1))?
17:50:00 <mjg> et al
17:51:00 <mjg> so there is an explicit compiler support for it
17:51:00 <heat> no
17:51:00 <heat> builtin:: is not compiler stuff I guarantee you that
17:51:00 <heat> __builtin_ would be, as in C
17:51:00 <mjg> well then happy grepping
17:52:00 <mrvn> Note that gcc/clang understand and detect various forem of memset() and will replace the code with their own. So you might generate 3 different memset flavours that all just turn into the compilers memset.
17:52:00 <mjg> so i'm sayin when they do this:
17:52:00 <mjg> return builtin::Memcpy<4>::head_tail(dst, src, count);
17:52:00 <mjg> if (count < 8)
17:52:00 <mjg> it *does* compile to an overlapping store, the way one would write in asm
17:53:00 <mjg> and for smaller sizes they correctly handle partial register access
17:53:00 <mjg> as in, perhaps modulo adding or not adding alignment to jump targets, this can generate the code i would write by hand
17:54:00 <mjg> so far anyway
17:54:00 <heat> ok, found the builtin
17:54:00 <heat> __builtin_memcpy_inline
17:55:00 <mjg> compiler support at last!
17:55:00 <mjg> there is also a lot of magic concerning what kind of simd to use
17:55:00 <heat> https://godbolt.org/z/sdYEYo3n6
17:55:00 <bslsk05> godbolt.org: Compiler Explorer
17:55:00 <mjg> all my above comments were about gpr
17:56:00 <heat> i don't see how this is doing overlapping stores
17:56:00 <mjg> because you used sizes known at compilation time
17:56:00 <heat> those are the only available options
17:57:00 <heat> you can't give it a non-constexpr value
17:57:00 <mjg> take a page from the original
17:57:00 <mjg> if (count < 4) lolpunt();
17:57:00 <mjg> if (count < 8) __builtin_memcpy_inline(..., count);
17:58:00 <mjg> https://dpaste.com/FGJA37DP8
17:58:00 <bslsk05> dpaste.com <no title>
17:58:00 <mjg> just copy some statements from there and adjust
17:58:00 <mjg> you literally provided no size arg :O
17:58:00 <heat> yeah erm no
17:59:00 <zid> if(count < 4) { DEBUG("Ever heard of MOV bro?"); }
17:59:00 <heat> I can't get it to do overlapping stores
17:59:00 <heat> no idea if there's more magic involved with this shit
18:00:00 <mjg> return builtin::Memcpy<8>::head_tail(dst, src, count);
18:00:00 <mjg> this magic compiles to it
18:00:00 <mjg> presumably they massage it to some extent
18:01:00 <heat> oh
18:01:00 <heat> they actually do the arithmetic
18:02:00 <mjg> anyway the fucking autogen uses a retired variant
18:02:00 <mjg> will have to update
18:02:00 <bnchs> hi
18:02:00 <heat> hi
18:02:00 <mjg> hello there
18:02:00 <heat> general
18:02:00 <heat> kenobi
18:03:00 <bnchs> bye
18:03:00 <mjg> have you seen that star wars movie with subs translated from english to chinese and *back*?
18:03:00 <mjg> there is a fan made dub using them
18:03:00 <heat> https://godbolt.org/z/6cnTcv5rK
18:03:00 <bslsk05> godbolt.org: Compiler Explorer
18:03:00 <heat> there you go
18:03:00 <heat> this is cool
18:04:00 <mjg> close to cool
18:04:00 <mjg> this being 2 separate statements is a little iffy
18:04:00 <mrvn> mjg: shì de
18:04:00 <mjg> i would prefer a more explicit "this is a fucking overlapping store situation"
18:04:00 <mrvn> 是的
18:05:00 <mjg> from this to that range
18:05:00 <heat> well this is supposed to be used by people who are bad at programming
18:05:00 <heat> like llvm-libc devs
18:05:00 <mjg> lemme find that star wars movie
18:06:00 <heat> fyi they like interleaved loads and stores
18:06:00 <mjg> https://www.youtube.com/watch?v=XziLNeFm1ok
18:06:00 <bslsk05> www.youtube.com: Star War The Third Gathers: Backstroke of the West HD (Dubbed) - YouTube
18:06:00 <mjg> heat: rgular or simd
18:07:00 <heat> regular
18:07:00 <mjg> were.dat
18:07:00 <heat> let me write a gpr memcpy with this shit
18:07:00 <heat> give me 5
18:08:00 <mjg> try the movie since 24:00
18:08:00 <mjg> i'm not looking at any asm until you do mofer
18:11:00 <heat> omg it unrolled the hot loop
18:11:00 <heat> cry.jpeg
18:12:00 <mjg> loop?
18:12:00 <mjg> if you slap builtin::Memcpy<64>::head_tail(dst, src, count); with -mno-sse i strongly suspect it will be the *pessimal* fuckton of straight movs
18:12:00 <mjg> a'la unrolled
18:14:00 <heat> https://godbolt.org/z/MqeWTTxxo
18:14:00 <bslsk05> godbolt.org: Compiler Explorer
18:15:00 <heat> addq $32, %rdi
18:15:00 <heat> addq $-32, %rdx
18:15:00 <heat> ja .LBB0_2
18:15:00 <heat> addq $32, %rsi
18:15:00 <heat> cmpq $31, %rdx
18:15:00 <heat> testq %rdx, %rdx
18:15:00 <heat> jne .LBB0_4
18:15:00 <heat> jmp .LBB0_13
18:15:00 <heat> truly genius code
18:15:00 <heat> all bow to the all seeing compiler
18:16:00 <mjg> it does generate interleaved ops tho as you mentioned
18:16:00 <mjg> will add that to my bench matrix
18:16:00 <heat> erm remove the fno-omit-frame-pointer ofc
18:16:00 <mjg> it also goes highest to lowest
18:17:00 <heat> yep
18:17:00 <heat> fyi gcc does not have this
18:17:00 <mjg> your code is wrong though
18:17:00 <heat> where?
18:18:00 <mjg> len >= 32 does not have to be a multiply of 32
18:18:00 <heat> hm?
18:18:00 <mjg> oh it is fine, i misread
18:19:00 <mjg> it also generates decreasing switch so to speak
18:19:00 <mjg> which is what typical hand-coded routines are doing
18:20:00 <mjg> and which is the exact opposite of what was handcoded in the automemcpy case
18:20:00 <mjg> testing these cases upfront vs what i have now is what i wanted to do anyway
18:20:00 <heat> i mean, I did explicitly make it decreasing
18:20:00 <mjg> now i got even more reason
18:21:00 <mjg> oh wait, you added if (*likely*)
18:21:00 <mjg> this may be why
18:21:00 <mjg> let's drop all hints
18:21:00 <mjg> aaand no switch upfront
18:21:00 <mjg> :p
18:21:00 <mjg> welp i'm gonna tst anyway, we will see what happens
18:22:00 <mjg> another gripe i have with the automemcpy thing is that they just randomized all their samples
18:22:00 <mjg> they have actual run of what really happened
18:22:00 <mjg> i found there are many cases where memset is called repeatedly for the same size
18:22:00 <mjg> which also means branch predictor is gonna learn
18:22:00 <mjg> and make it faster
18:22:00 <mjg> this goes away if you roll with random
18:23:00 <heat> to be fair I think it's interesting to see what happens if you write it non-optimally
18:23:00 <heat> what I wrote is pretty much a translation-ish of my memcpy
18:24:00 <heat> so what happens if you write a bad memcpy like llvm-libc
18:24:00 <mjg> it is not *bad*
18:24:00 <heat> just a bunch of heavily abstracted if cases or something
18:24:00 <heat> well, the way it is coded is
18:24:00 <mjg> wdym
18:24:00 <heat> it's just praying the compiler gets it right
18:25:00 <mjg> afaiu the compiler was modified specifically to generate precisely what they ask for
18:25:00 <mjg> so i don't htink there is much praying going on here
18:25:00 <heat> __builtin_memcpy_inline is, yes
18:25:00 <heat> branching, etc? no
18:25:00 <heat> unless there's some "if __builtin_memcpy_inline is used, this is a memcpy function so lay things out like we want to"
18:26:00 <mjg> the generated asm is pretty clearly what they intended
18:26:00 <mjg> i agree in principle things may change and it can start compiling differently
18:26:00 <heat> shrug
18:27:00 <heat> I wrote my function in a very explicit "this is how I want it to be" but it completely ignored my suggestion
18:29:00 <mjg> i don't htink it did
18:29:00 <mjg> you got the size switch upfront, as requested with 'likely'
18:29:00 <mjg> and the 32 byte loop was placed prior to other stores
18:30:00 <heat> I didn't ask for a switch
18:30:00 <heat> i wrote a cascading thing like me/freebsd/linux
18:31:00 <mjg> if you wanted a cascade you should have written it like automemcpy
18:31:00 <mjg> if (size > x)
18:31:00 <mjg> a bunch of times
18:32:00 <heat> ok so there's no control in this stuff
18:32:00 <heat> good to know
18:32:00 <heat> compiler giveth, compiler taketh away
18:33:00 <heat> "hurr durr bad codegen send a compiler patch" does not work for most people
18:33:00 <mjg> now i rmeember why i did 32 byte read upfront
18:33:00 <mjg> it is so that when used as memmove i don't have to check for overlap
18:33:00 <heat> which branch?
18:33:00 <mrvn> If you have N checks of (size > x) have you considered benchmarking all the different orders for those tests and doing them linear or as tree?
18:34:00 <heat> all the <32 stuff you have does loads up front for that exact reason
18:34:00 <heat> you even commented that, pretty smart
18:35:00 <mjg> note none of these do the pointer comparison
18:35:00 <mjg> i would like to point out that for copyin/copyout use you can assume the user pointer is lower
18:36:00 <mjg> so perhaps codegen could reflect that
18:38:00 <mjg> i have no clue what impact, if any, this has
19:06:00 <mjg> heat: https://github.com/scylladb/dpdk/blob/master/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
19:06:00 <bslsk05> github.com: dpdk/rte_memcpy.h at master · scylladb/dpdk · GitHub
19:09:00 <mjg> > memcpy() simply copies data one by one from one location to another. On the other hand memmove() copies the data first to an intermediate buffer, then from the buffer to destination.
19:09:00 <mjg> :d
19:11:00 <mjg> // Sample program to show that memmove() is better than
19:11:00 <mjg> // memcpy() when addresses overlap.
19:11:00 <mjg> :D
19:11:00 <mjg> too bad they don't have a chapter on writing memcpy itself
19:11:00 <mjg> ... they do
19:11:00 <mjg> fuck me
19:12:00 <mjg> > How to implement memmove()?
19:13:00 <mjg> // Create a temporary array to hold data of src
19:13:00 <mjg> char *temp = new char[n];
19:13:00 <mjg> ey heat, you can learn a thing or two here https://www.geeksforgeeks.org/write-memcpy/
19:13:00 <bslsk05> www.geeksforgeeks.org: Write your own memcpy() and memmove() - GeeksforGeeks
19:13:00 <mjg> fucking retarded
19:13:00 <mrvn> mjg: ouch, that segfaults due to stack overflow so fast you can't even blink
19:14:00 <gog> lmao watf
19:14:00 <zid> memmove isn't implementable in C without a compiler loophole to define it
19:15:00 <zid> which is fun
19:15:00 <mrvn> zid: huh? Sure it is.
19:15:00 <Ermine> gog: may I pet you
19:15:00 <gog> yes
19:15:00 * Ermine pets gog
19:15:00 * gog prr
19:16:00 <gog> it requires a check for overlap, and if the overlap is idk which way you have to do it head to base rather than base to head
19:16:00 <bnchs> mjg: bro, don't use geeks for geeks
19:16:00 <mjg> bnchs: what? why
19:16:00 <Ermine> depends on overlap kind
19:16:00 <mjg> look GREAT
19:17:00 <mrvn> The only problem is comparing 2 pointers that may or may not belong to the same allocation. So it's implementation defined.
19:17:00 <bnchs> it's slow, bloated, and has low-quality or copied answers
19:17:00 <gog> yeahhhh
19:17:00 <gog> aliasing rules
19:17:00 <mjg> bnchs: no wait mate
19:17:00 <mjg> bnchs: no way mate
19:17:00 <mrvn> bnchs: it also lies about the Auxiliary Space: O(1), it's O(n)
19:17:00 <mjg> how can this be bad if it is top google result
19:17:00 <mjg> basic logic mate
19:17:00 <Ermine> ah indeed, you cannot compare pointers...
19:17:00 <bnchs> because they pump their SEO
19:17:00 <bnchs> lol :3
19:18:00 <mrvn> gog: char * can always alias.
19:18:00 <mjg> but if it was bad people would not be using it
19:18:00 <mjg> i'm gonna take their memmove without credit
19:18:00 <Ermine> ... unless they belong to the same allocation...
19:23:00 <mjg> their shell is my zsh replacement
19:23:00 <mjg> https://www.geeksforgeeks.org/making-linux-shell-c/?ref=rp
19:26:00 <mrvn> If you want to write memcpy() in 100% correct C is there any other way but copying byte by byte?
19:27:00 <mrvn> You can't legally cast src/dst to short, int, long to get larger read/writes that I can see.
19:27:00 <mjg> fair q, i have no idea
19:30:00 * nikolar pets gog
21:13:00 <geist> mrvn: i dont *think* so, but then i think you might be technically forbidden from casting it to a char either
21:14:00 <geist> does make you wonder what memcpy looked like in environments were pointers were natively fat. ie segment:offset kinda stuff
21:14:00 <geist> or maybe in those environments they werent? ie, DOS large memory model
21:15:00 <mrvn> geist: you can cast anything to char. It's defined to give you the bit representation of an object.
21:16:00 <mrvn> char*
21:16:00 * geist nods
21:19:00 <mrvn> I guess you can do: uint16_t t = (((uint16_t)src[i]) << 8) | (src[i+1]); and hope the compiler optimizes that into a single load. But then you need to fix that for the hosts endianness and convince the compiler about alignment issues.
22:08:00 <heat> mjg, oh my god that's a genius strategy
22:08:00 <heat> i'm surprised it's not in llvm-libc
22:09:00 <heat> hmm, can't really randomly allocate in the kernel can you
22:09:00 <heat> so in my case I will do the trivial optimization and allocate the temporary buffer on the stack
22:16:00 <mjg> right
22:16:00 <mjg> make sure to have multimegatybe stacks tho
22:16:00 <mjg> i mean you don't want limitations to memmove
22:17:00 <mjg> now, if they worked on automemmove
22:17:00 <mjg> perhaps they would stumble upon it
22:17:00 <mjg> lol @ folk 's what i'm sayin
22:18:00 <mrvn> Oh yeah, multimegatybe kernel stack for every thread. That will be so efficient.
22:20:00 <mjg> 1G stacks
22:20:00 <heat> real talk now im wondering if that can actually be faster than copying backwards
22:20:00 <mrvn> I thought we determined that the prefetcher will notice backwards sequential access and it won't be any slower
22:20:00 <mjg> ye i'm gonna 1. revisit the optimization docs 2. run some benchez once machinery gets operational
22:20:00 <heat> assuming you already have a temporary buffer, two forward copies vs one backwards
22:20:00 <mjg> brah
22:21:00 <mjg> openbsd way would be ONE buffer
22:21:00 <mjg> and all cpus just take turns doing memmove
22:21:00 <heat> one buffer and a bkl?
22:21:00 <heat> sgtm
22:21:00 <mjg> ok mjg@
22:22:00 <mjg> now i wonder, when i memset
22:22:00 <mjg> should i memset a temp buf and copy that?
22:22:00 <heat> geist, latest tianocore should support riscv OVMF
22:23:00 <mrvn> mjg: a buffer or a bunch of registers?
22:23:00 <mjg> what
22:23:00 <mjg> a buffer geeksforgeeks style
22:23:00 <mrvn> and you want to load that buffer into registers over and over?
22:23:00 <heat> obviously you use a global char protected by a mutex
22:23:00 <mrvn> or a buffer with the full size?
22:23:00 <heat> each iteration locks and unlocks
22:24:00 <mjg> each 1 byte access
22:25:00 * mrvn is still set on implementing memcpy/memset/memmove via DMA engine.
22:25:00 <mjg> now that i think of it i can wrap it in a zero cost abstraction for maximum performance
22:25:00 <heat> omg zero cost abstraction i write c++ i should know this omg omg
22:26:00 <mjg> here is a geezer story for you
22:26:00 <mjg> fucking guy claimed lock profiling has "sampling" implemented
22:26:00 <mjg> i tried to be nice and simply said it does not
22:26:00 <mjg> but he was adamanat
22:27:00 <mjg> the "sampling" was incrementing *one* global var for every lock acquire
22:27:00 <mjg> so i enabled it
22:27:00 <mjg> perf went to shit so bad it was not even funny
22:27:00 <mjg> to his credit he conceded it perhaps does not work as intended
22:28:00 <heat> you like shit talking solaris so much but they literally gave you your favourite tool in the world, dtrace
22:29:00 <heat> if freebsd was in charge of profiling you would still be doing lock->lock_acq++;
22:29:00 <mjg> to solaris credit, not only they have a memset which does not just roll with rep
22:29:00 <mjg> but according to at least one comment they checked real wowrld sizes which get passed in
22:30:00 <mjg> that's way above average right there
22:30:00 <heat> lol
22:31:00 <heat> actually, do you have any sort of real world kernel allocation profiles?
22:31:00 <mjg> you mean what sizes land in kmalloc et al?
22:31:00 <heat> yep
22:31:00 <mjg> i did, nothing i can refer to right now
22:32:00 <heat> i'm curious to see how those look. probably varies wildly based on kernel?
22:32:00 <mjg> however, freebsd is kind of nice here
22:32:00 <mjg> you can vmstat -mz any long running box and get the stats
22:32:00 <mjg> i can ask netflix to give some
22:32:00 <heat> oooooooooooh please do
22:32:00 <mjg> have you seen how vmstat -mz looks like?
22:32:00 <heat> no
22:33:00 <mjg> Type InUse MemUse Requests Size(s)
22:33:00 <mjg> tmpfs name 11 1K 2844 16,32,64,128
22:33:00 <mjg> tmpfs dir 11 1K 2805 64
22:33:00 <mjg> GEOM 765 124K 3842 16,32,64,128,256,512,1024,2048,8192,16384
22:33:00 <mjg> ITEM SIZE LIMIT USED FREE REQ FAILSLEEP XDOMAIN
22:33:00 <mjg> 2 Bucket: 32, 0, 2125, 75743, 432494, 0, 0,79248
22:33:00 <mjg> 4 Bucket: 48, 0, 4946, 51670, 188727, 0, 0,59944
22:33:00 <mjg> 8 Bucket: 80, 0, 6513, 22837, 817309, 107, 0,58590
22:33:00 <mjg> 16 Bucket: 144, 0, 2561, 12503, 115959, 0, 0,29264
22:33:00 <mjg> etc.
22:33:00 <heat> my slab accounting is totally broken because I don't have pcpu counters yet
22:34:00 <heat> oh yeah what freebsd userspace struct is 224-sized?
22:34:00 <mjg> struct-fucking-stat
22:34:00 <heat> aha
22:35:00 <heat> I stared into your bufsizes for a bit and I could tell that the results did change from memcpy -> copyin -> copyout
22:35:00 <heat> it's funny
22:35:00 <mjg> so one interesting bit to ponder is what kind of malloc buckets make sense
22:35:00 <mjg> solaris rolls with several multiplies of 8
22:35:00 <heat> what workload did you run for bufsizes.txt?
22:35:00 <mjg> huge granularity
22:36:00 <mjg> at this point i don't remember mate :p
22:36:00 <mjg> probably building some shit
22:36:00 <mjg> i got some others
22:36:00 <mjg> including fresh onesfrom prod
22:37:00 <mjg> netflix guy says he is afk but can post stuff in 2h
22:37:00 <heat> is this warner losh
22:37:00 <mjg> no
22:37:00 <mrvn> mjg: I think the kernel basically always knows what size objects it needs and data should be kept local. So subsystems should make SLABs for specific sizes they need and you can optimize that far better.
22:37:00 <heat> half of freebsd is developed by him
22:37:00 <heat> ah ok
22:37:00 <heat> that guy and cristos zoulas are both fucking omnipresent
22:37:00 <mrvn> mjg: One thing that doesn't fall into that case though is the name lookup cache since file names are pretty variable.
22:39:00 <mjg> i can give you some bufsize.txt from netflix as well if the guy agrees
22:39:00 <mrvn> But even if you go with the malloc buckets then you know that all memory will be 8/16 byte aligned and have a size that's a multiple of 8/16. You can optimize that nicely too.
22:39:00 <heat> that would be most interesting, thanks
22:39:00 <mjg> linux has 8, 16, 32, 64, 96, 128, 192. 256, 512
22:39:00 <mjg> and then * 2 from there up to 8k
22:40:00 <mjg> i added 768 on freebsd to appease zfs which is doing tons of funny allocs
22:40:00 <mjg> total ram usage went down from it
22:40:00 <mjg> ops it was 384
22:41:00 <heat> my only restriction right now is that sizes need to be 16 byte aligned
22:42:00 <mrvn> heat: you have a prev and next pointer in every free object?
22:45:00 <heat> mjg, 2 qs, 1) what do you think of the linux buddy page allocator 2) how does dtrace, etc effectively get data if you can't memory allocate? just pick a buf size?
22:45:00 <heat> s/memory allocate/allocate memory/g <-- yoda speak moment
22:46:00 <mjg> i don't know linux buddy page alocator
22:46:00 <mjg> dtrace has a bunch of safety measures to abort
22:46:00 <mjg> it preallocs bunch of shit and if it does not fit, you get told there are drops
22:46:00 <mjg> similarly, if it uses too much cpu, it decides something is way off and aborts tracing
22:46:00 <mjg> "systemic unresponsivness" or so they call it
22:47:00 <mjg> see the -b parameter
22:54:00 <heat> man i need bpftrace
22:55:00 <heat> this is GREAT
23:27:00 <frkazoid333> https://arstechnica.com/information-technology/2023/03/unkillable-uefi-malware-bypassing-secure-boot-enabled-by-unpatchable-windows-flaw/
23:27:00 <bslsk05> arstechnica.com: Stealthy UEFI malware bypassing Secure Boot enabled by unpatchable Windows flaw | Ars Technica
23:43:00 <gog> i got the boots
23:44:00 <gog> and i got my gun from the pigs
23:46:00 * sakasama blinks at gog.