channel logs for 2004 - 2010 are archived at http://tunes.org/~nef/logs/old/ ·· can't be searched
#osdev2 = #osdev @ Libera from 23may2021 to present
#osdev @ OPN/FreeNode from 3apr2001 to 23may2021
all other channels are on OPN/FreeNode from 2004 to present
Sunday, 12 June 2022
00:18:00 <gog> i know a guy
00:18:00 <gog> i call him hraunmann cause hes got the lava
02:44:00 <mrvn> great, I finally know a guy who knows a guy.
02:50:00 <moon-child> did the guys you knew previously not know guys?
02:50:00 <mrvn> ask gog
02:51:00 <moon-child> hmm, is knowing somebody reflexive? Because if so...
03:05:00 <mrvn> bool knows(const Guy&, const Guy&) [[reflexive]];
03:34:00 <ckie> came here to paste a little journal entry from my project. might help any other new people who are lurking
03:34:00 <ckie> I think actually going for barebones x86_64 was stupid and a waste of time. Should've just started with the compiler, linked against SDL2 and went from there since I don't really care about where the SMP bit is on the intel core 2 uf02urtf or whatever. I wanna see it alive.
03:35:00 <ckie> .. yes i do know the wiki mentions this. i read it all, but i did not listen
03:54:00 <mrvn> ckie: what is 2 * 3 + 4?
03:54:00 <ckie> mrvn: i'm not a robot ya nerd
03:55:00 <ckie> (that is also how i interact with captchas. habit)
03:56:00 <ckie> last time i was here i asked about the segment registers because i was trying to change the gdt layout even though it was already valid
03:56:00 <mrvn> wow, that sentense has less word salat than your "journal entry".
03:57:00 <ckie> mrvn: well the word salad is because i wasn't excepting to actually show it to anyone except myself
03:57:00 <ckie> probably should've cleaned it up
03:57:00 * ckie is tired
03:59:00 <ckie> also using "ya nerd" as a derogatory was stupid. further evidence i should sleep instead of reading my code and placing ranty comments about how i did things stupidly
04:00:00 <zid> I also have absolutely 0 idea what the context is :P
04:01:00 <zid> just that you seem like you cleared some hurdle
04:01:00 <zid> in terms of a mistaken approach
04:02:00 <ckie> zid: yup, and i am very annoyed at past me for being so dumb with the previous design
04:02:00 <zid> whatever that may have been :P
04:03:00 <mrvn> zid: building a car was stupid for hime because he needs a blender so he can butter the pizza and go from there because who cares if the tires are rubber.
04:03:00 <ckie> when i'm not tired i churn out a ton of code and then when i get tired my thoughts slow down and i find the architectural problems in what i programmed
04:04:00 <mrvn> zid: scroll up 30m. That sentence makes about as much sense as his original one.
04:04:00 <ckie> s/his/their
04:05:00 <ckie> mrvn: the annoying part is i can't tell which sentences make sense and which don't
04:06:00 <ckie> i can only see how i basically wrote a giant bug
04:07:00 <mrvn> "Where we are going we don't need roads"
04:07:00 <ckie> the quotes are nice for my melted brain
04:07:00 <ckie> do continue
04:09:00 <ckie> also was 2*3+4 supposed to be evaluated with operator precedence
04:09:00 <ckie> s/$/\?/
04:10:00 <ckie> s/\?/?/
04:10:00 <ckie> i should really sleep
04:10:00 <ckie> i'm gonna do that
04:10:00 <ckie> mrvn, zid: thanks for enduring my brain puke. good night
05:11:00 <Mutabah> Ugh... why did I decide to implement EHCI
05:11:00 <Mutabah> The hardware interface is quirky and has complex documentation
05:12:00 <Mutabah> and then you get to the USB2/USB1 split transaction handling
05:12:00 <Mutabah> (tip: Run, run very far)
05:14:00 <mrvn> run silent, run deep
06:47:00 <Mutabah> For those who don't know: EHCI needs to know what USB2 hub (device ID and port) is used for any non-usb2 device you talk to
06:48:00 <Mutabah> And needs to know the speed of that device
07:08:00 <zid> Have you considered OHCI
07:51:00 <Mutabah> Oh, I've implemented that already
07:52:00 <Mutabah> I should just dump EHCI and go to XHCI
07:57:00 <zid> huh interesting
07:57:00 <zid> I guess technically qemu's q35 is OCHI, but I don't imagine they implement it?
07:58:00 <zid> wait I don't mean OCHI that's why
07:58:00 <zid> fuck
07:58:00 <zid> I meant UHCI
09:25:00 <Mutabah> ohci/uhci are USB 1 controllers
09:25:00 <Mutabah> ehci is USB2, and a pile of hack
09:25:00 <Mutabah> xhci is USB3+ and aparently is not too bad
12:03:00 <jafarlihi> Is it possible to fuzz OpenBSD running in KVM using syzkaller without creating another VM inside OpenBSD?
12:40:00 <heat> jafarlihi, yes but you need to change syzkaller
12:41:00 <heat> fuchsia and another system run in that mode
12:45:00 <jafarlihi> Is it worth it? That is, will I find bugs or is it fuzzed to death already?
12:49:00 <heat> fuzzed to death
12:49:00 <heat> https://syzkaller.appspot.com/openbsd
12:49:00 <bslsk05> syzkaller.appspot.com: syzbot
15:25:00 <ddevault> this kernel project is the most fun I've had programming in some time
15:27:00 <heat> welcome to the biggest timesink of the rest of your life
15:27:00 <ddevault> this is my second kernel and, uh, fifth huge project
15:28:00 <zid> weren't you raging yesterday about segment selectors? :P
15:29:00 <heat> it's part of the experience
15:32:00 <ddevault> oh, naturally there are frustrating parts
15:32:00 <ddevault> I did choose an implementation language which doesn't even have DWARF support
15:36:00 <ddevault> I am dreading ACPI though
15:37:00 <heat> import ACPICA
15:38:00 <ddevault> I will, eventually, in userspace
15:38:00 <ddevault> but I intend to defer it as long as possible
15:38:00 <heat> microkernel?
15:38:00 <ddevault> yeah
15:38:00 * heat cries in /unix
15:38:00 <ddevault> I currently don't have any provisions for running C code yet
15:38:00 <ddevault> so it will take a little bit of effort to set up C drivers like ACPICA
15:39:00 <zid> 'C drivers like acpica'?
15:39:00 <ddevault> ACPICA is, in fact, written in C
15:39:00 <heat> #fact
15:39:00 <zid> I mean sure, but
15:40:00 <ddevault> I will have to write up a little C library for interfacing with the syscall API and kernel services for it
15:40:00 <ddevault> not libc, thankfully
15:40:00 <zid> ah in that eense
16:17:00 <mrvn> ddevault: do you have inline asm?
16:18:00 <ddevault> no
16:19:00 <ddevault> we dpm
16:19:00 <ddevault> we don't* intend to add it, either
16:19:00 <ddevault> https://l.sr.ht/nmS5.png :D
16:20:00 <mrvn> You need some asm, no way around that. So better get started on that C interface.
16:20:00 <ddevault> we write assembly files separatley and link to them
16:20:00 <ddevault> works fine
16:20:00 <ddevault> would not have gotten this far without it
16:20:00 <mrvn> then why can't you link to C files the same way?
16:20:00 <ddevault> well, I could, but it would require some build system stuff
16:20:00 <ddevault> and in any case, I don't want to
16:21:00 <ddevault> ACPICA should be implemented as a standalone driver, not linked to Hare code imo
16:21:00 <mrvn> one entry to compile C source, big deal. :)
16:21:00 <ddevault> communicate over IPC
16:27:00 <j`ey> ddevault: does sr.ht not have search ability?
16:27:00 <ddevault> what are you searching for?
16:28:00 <j`ey> stuff inside a repo
16:28:00 <ddevault> no, not yet
16:31:00 <j`ey> ddevault: https://git.sr.ht/~sircmpwn/helios/tree/master/item/caps/%2Bx86_64.ha in this view it shows the commit '970eee3d'.. but that doesnt touch this file
16:31:00 <bslsk05> git.sr.ht: ~sircmpwn/helios: caps/+x86_64.ha - sourcehut git
16:31:00 <ddevault> yeah, that's just the latest commit on that branch
16:31:00 <j`ey> confusing (for github users at least..)
16:32:00 <ddevault> agreed
16:32:00 <j`ey> anyway, LAST = PAGE looks wrong?
16:32:00 <ddevault> good call
16:34:00 <zid> what's a .ha out of interest
16:34:00 <ddevault> https://harelang.org
16:34:00 <bslsk05> harelang.org: The Hare programming language
16:35:00 <zid> I don't like their example code so clearly it's a terrible language with no redeeming features
16:35:00 <j`ey> lol
16:35:00 <ddevault> I designed this language -_-
16:35:00 <zid> I wasn't talking about the design
16:36:00 <j`ey> buf: *[*]u8: uintptr: u64,
16:36:00 <j`ey> wut
16:36:00 <ddevault> context?
16:36:00 <j`ey> https://git.sr.ht/~sircmpwn/helios/tree/master/item/vulcan/rt/syscalls.ha#L19
16:36:00 <bslsk05> git.sr.ht: ~sircmpwn/helios: vulcan/rt/syscalls.ha - sourcehut git
16:36:00 <ddevault> yeah, this is not great
16:36:00 <ddevault> can skip the *[*]u8 cast
16:38:00 <zid> I still can't figure out what 0z means, the quick language guide has a section on inferred suffices but it isn't there, i u u8 and f32 are
16:38:00 <ddevault> it's size, which is equivalent to size_t
16:40:00 <mrvn> zid: follow the yellow brick road
16:41:00 <zid> surprised it isn't ssize_t but it makes sense to use z for it at least
16:42:00 <citrons> why ssize_t?
16:42:00 <zid> because he has a u for unsigned
16:42:00 <zid> and z is from the printf specifier for size_t %zu
16:42:00 <ddevault> z is from the "z" in "size"
16:42:00 <ddevault> which is also where printf gets it from
16:42:00 <zid> so z without u to me says "ssize_t"
16:42:00 <ddevault> hare does not link to libc, we do not need any of its baggage
16:42:00 <citrons> ah. well, ssize_t wouldn't be very useful for hare
16:42:00 <zid> maybe not
16:42:00 <mrvn> why do syscalls always return a pair of u64?
16:43:00 <zid> nor was I suggsting anything
16:43:00 <ddevault> efficiency, we have two return registers
16:43:00 <zid> I just said it was surprising
16:43:00 <citrons> sure
16:43:00 <mrvn> It should be i8, i16, i32, i64, iz, u8, u15, u32, u64, uz
16:43:00 <mrvn> jsut for consistentcy
16:44:00 <ddevault> we don't have a signed size type at all
16:44:00 <ddevault> something cannot have a negative size
16:44:00 <ddevault> it exists only as a hack for libc
16:44:00 <mrvn> ddevault: nothing for a ptrdiff_t?
16:44:00 <ddevault> not presently, no
16:44:00 <mrvn> Actually, is it i64 ot s64?
16:44:00 <ddevault> i64
16:45:00 <zid> Okay so the *actual* problem I have with the example, is that it seems to be throwing grammar at the screen in order to demonstrate it, *I hope*
16:45:00 <zid> Because if it hasn't done that this code is kind of crazy
16:45:00 <citrons> which example
16:45:00 <zid> the one.. ont he front page
16:45:00 <zid> that I got linked
16:45:00 <mrvn> 14 lines for hello world. hare sucks.
16:46:00 <ddevault> I would be more prepared to listen to your arguments if you had spend more than 90 seconds with the language
16:46:00 <ddevault> spent*
16:46:00 <zid> I wasn't talking about the language
16:46:00 <zid> for the nth time
16:46:00 <zid> You're getting defensive over something and I don't know what or why
16:46:00 <j`ey> then whats wrong with the example?
16:46:00 <mrvn> ddevault: no i++ or ++i in hare?
16:46:00 <zid> I said the example looks like it's using grammar just to use it, not that anything is wrong with it
16:46:00 <ddevault> mrvn: no, we haven't bothered but I'm not entirely opposed
16:47:00 <mrvn> ddevault: how about range-for?
16:47:00 <ddevault> no
16:47:00 <zid> if the z there is *required* to make it compile, then I'd maybe have something to say about the /language/ but I don't know anything about the language so I can't
16:47:00 <mrvn> generators?
16:47:00 <mrvn> list comprehension?
16:47:00 <ddevault> not as a first-class language feature
16:47:00 <ddevault> and also no
16:47:00 <ddevault> hare is closer to C than to anything else
16:47:00 <ddevault> knock it off, zid
16:47:00 <mrvn> are strings 0 terminated or length prefixed?
16:47:00 <zid> Literally no idea what you're talking about
16:47:00 <ddevault> neither
16:48:00 <ddevault> the length and data are stored separately
16:48:00 <mrvn> so mor like c++ then
16:48:00 <ddevault> a str type contains a length and a pointer
16:49:00 <mrvn> Fist problem I have with hare just form the front page: "const"? Why isn't that the default. Mutable should be the exception that you mark down.
16:49:00 <ddevault> there is no default
16:49:00 <ddevault> let and const are different keywords
16:50:00 <ddevault> it's not let const ... or let mut ...
16:50:00 <mrvn> and let is mutable and const is not?
16:50:00 <zid> heat: I love your reddit name :D
16:50:00 <heat> thanks
16:50:00 <heat> it's a great name that I CANT FUCKING CHANGE WHY CANT I CHANGE IT AHHHHHHHHHH
16:50:00 <ddevault> correct
16:51:00 <heat> zid, lots of doomers in that post btw
16:51:00 <mrvn> ddevault: so I could write: for (const i = 0z; ; )?
16:51:00 <ddevault> no, the minimum required for a for loop is the condition
16:51:00 <ddevault> you can write for (const i = 0z; true) though
16:52:00 <zid> that post being the nvidia gpu one?
16:52:00 <mrvn> ddevault: strange choice to use "let" there and not "var" or something
16:52:00 * ddevault shrugs
16:52:00 <ddevault> I like my bike sheds painted blue, how about you
16:52:00 <mrvn> ddevault: clashes with my tardis
16:52:00 <heat> zid, yes
16:52:00 <zid> I assume he meant to add "Given that the newer nvidia cards will be moving to firmware based drivers, do you think the DRI layer will get thinner, and if so, will it allow hobby OSs to possibly interface nvidia cards?"
16:52:00 <Jari--> Does anyone still use Objective C? I used to work on telecom company, doing GTK+/Objective C, Maemo, etc.
16:52:00 <zid> but he asked "CAN OS USE NVIDIA???/"
16:53:00 <citrons> I wish lua used `let` instead of `local`
16:53:00 <citrons> so much typing
16:53:00 <mrvn> Don't want to run from some monster and end up in my bike shed by accident
16:53:00 <zid> I wish lua had 'continue'
16:53:00 <citrons> it has `function` and not `fn` as well
16:53:00 <heat> zid, OS could always use nvidia
16:53:00 <heat> like, it has been done
16:53:00 <citrons> apparently `continue` is weird with lexical scoping
16:53:00 <mrvn> both lua and hare sucks, they have no "fun".
16:53:00 <heat> it's not easy, but you can do it and will be able to do it
16:54:00 <zid> I'm not aware of any hobby OS that implements say, accelerated DX though
16:54:00 <heat> zid, DX?
16:54:00 <zid> directx
16:54:00 <ddevault> DX is proprietary you dork
16:54:00 <ddevault> you mean GL or VK
16:54:00 <zid> no I don'#t
16:54:00 <zid> I mean DX
16:54:00 <ddevault> right
16:54:00 <Jari--> you get sued really well
16:54:00 <ddevault> good luck with that
16:54:00 <heat> you theoretically can if you use wine
16:54:00 <mrvn> Just compile wine for the OS and you have DX
16:54:00 <Jari--> whine
16:55:00 <citrons> Wine Is Not A Hobby OS
16:55:00 <heat> you'll never get sued if you implement DX
16:55:00 <zid> wine doesn't do it natively does it? afaik for older dx at least it was rewriting it as gl
16:55:00 <heat> obviously
16:55:00 * Jari-- is working on Commodore Basic virtual machine project
16:55:00 <zid> you can do native dx if you have a windows host and microsoft's balloon driver though
16:55:00 <zid> in a vm
16:55:00 <citrons> I'm sure reactos aspires to have an open source directx
16:55:00 <heat> it does
16:55:00 <heat> ...
16:55:00 <ddevault> reactos is largely based on wine
16:56:00 <zid> which is basically the solution you *might* be able to get away with, if nvidia's plan is set up to allow it to be easy
16:56:00 <ddevault> it will probably just use the same OpenGL translation layer
16:56:00 <citrons> probably
16:56:00 <zid> I've not seen the details of what nvidia's up to, other than "driver now lives in rom"
16:57:00 <mrvn> ddevault: when you specify a return type of ((u64, u64) | syserror) how does the code know which of the two it is?
16:57:00 <ddevault> it's a tagged union
16:57:00 <zid> how big the userspace portion would have to be to submit dx commands to that, but I was hopeful for "idk, less than before though!"
16:57:00 <ddevault> so it checks the tag
16:57:00 <heat> zid, probably still the same interface
16:57:00 <heat> so, probably still the same userspace portion
16:57:00 <mrvn> ddevault: I see no tag there. what if I want ((u64, u64) | (u64, u64) | syserror)?
16:58:00 <heat> as far as I understand it, the idea is to get an open source nvidia vulkan driver
16:58:00 <citrons> `(a | b)` is the annotation for a tagged union type
16:58:00 <j`ey> mrvn: the compiler adds the tag
16:58:00 <citrons> it's a language feature
16:58:00 <ddevault> mrvn: that collapses to ((u64, u64) | syserror)
16:58:00 <ddevault> the tags are implicitly assigned
16:58:00 <heat> then use zink to get opengl over vulkan
16:58:00 <mrvn> so you can't have 2 things in an union that have the same structure?
16:58:00 <heat> it wouldn't be a bad idea to get direct3D over vulkan as well
16:58:00 <mrvn> No explicit tags?
16:58:00 <ddevault> no explicit tags
16:58:00 <zid> hopefully the details are fun rather than boring
16:59:00 <ddevault> you can have two things with the same structure by defining new type aliases
16:59:00 <ddevault> new type, same storage and semantics
16:59:00 <Jari--> Level of Windows 95 support? ReactOS? Or is it 2000, 2003, etc. what could be relative Windows product in features it supports currently?
16:59:00 <zid> if all they're doing is literally using a BAR to implement mmap("nvidia.sys") then that's super super boring
16:59:00 <heat> zid, there's a whole API built around the firmware
16:59:00 <heat> which is RISCV btw, suck it zid
16:59:00 <heat> RISCV is a real architecture
17:00:00 <zid> riscv is free, and exists
17:00:00 <heat> facts
17:00:00 <zid> and nvidia own fabs
17:00:00 <Jari--> BSD386 FTW !
17:00:00 <zid> so it makes sense
17:00:00 <heat> what
17:00:00 <ddevault> I have a RISC-V machine right here :)
17:00:00 <heat> you mean 386BSD?
17:00:00 <ddevault> helios will be ported to it in the foreseeable future
17:00:00 <heat> what does that have to do with anything
17:00:00 <citrons> I'll have a riscv system someday
17:01:00 <heat> anyway, it's literally just a blob of code that the nvidia co-processor executes and you interface with the blob using an API
17:01:00 <heat> it's not mmap("nvidia.sys")
17:01:00 <zid> Yea I said I didn't know, I wasn't suggesting it was
17:01:00 <Jari--> I would prefer to connect and dock my personal hand phone (Android) to a desktop PC's monitor, keyboard and mouse...
17:01:00 <zid> mmap(nvidia.sys) is what *used* to happen
17:01:00 <zid> so you're back to front anyway
17:02:00 <heat> executing arbitrary code sounds like a good way to not get your driver signed
17:02:00 <zid> I said I hope they're not just going do do what used to happen via mmap, loading the 'installed nvidia driver into memory' via mmap, but now as a BAR, and that they're actually going to do cool things
17:02:00 <zid> It won't be a driver though it'll be firmware!
17:02:00 <heat> yes i know
17:02:00 <zid> so who needs to sign anything, muahaha
17:02:00 <ddevault> most drivers use firmware
17:02:00 <ddevault> similar designs have already shipped thousands of times
17:03:00 <mrvn> firmware -- code where other people have installed backdoors in
17:03:00 <heat> you need to have your driver signed else your driver won't load on secure booted machines
17:03:00 <zid> yes, again, not what I said
17:03:00 <ddevault> and firmware does not run on the host CPU
17:03:00 <ddevault> it runs on the device itself
17:03:00 <zid> WHy would they need to *submit* it for wqhl, if the point of this is to move the driver code into the firmware
17:04:00 <heat> submit the fw? they wouldn't
17:04:00 <zid> So why did you say what you said
17:04:00 <zid> I said they wouldn't need it signed
17:04:00 <heat> yes, the fw won't
17:04:00 <zid> you said "it needs to be signed though"
17:04:00 <heat> but the driver will
17:05:00 <zid> yes, but the driver won't change
17:05:00 <zid> they get it signed once then just keep doing 'firmware updates'
17:05:00 <ddevault> or just get new drivers signed like they do in a normal release cycle
17:05:00 <heat> at the first sight of "oh this runs native code on your native CPU fetched from firmware?" it would get blacklisted pretty quickly
17:05:00 <ddevault> that's *not* how it works
17:05:00 <heat> that's why kernels and bootloaders behave the way they do
17:06:00 <zid> and presumably why they have the risc-v'y bit
17:06:00 <heat> yes it is, any piece of code that can load arbitrary code and is signed will get blacklisted
17:06:00 <mrvn> if it doesn't require the firmware to be signed then it should get banned
17:07:00 <mrvn> ddevault: "firmware does not run on the host CPU". If it has access to memory, e.g. DMA, then that distinction is meaningless.
17:08:00 <ddevault> well, that's where something like IOMMU comes in
17:09:00 <mrvn> one can hope
17:54:00 <mrvn> does anyone have inline asm stubs for adcx and adox?
18:19:00 <moon-child> there are intrinsics
18:20:00 <mrvn> named what?
18:21:00 <mrvn> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 gcc still has a bug about it open
18:21:00 <bslsk05> gcc.gnu.org: 79173 – add-with-carry and subtract-with-borrow support (x86_64 and others)
18:21:00 <GeDaMo> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=adcx
18:21:00 <bslsk05> www.intel.com: Intel® Intrinsics Guide
18:22:00 <moon-child> idk I saw them in the manual, don't use intrinsics much
18:26:00 <heat> heres a tip: they're hard to use in kernels
18:26:00 <heat> i.e you can't use SSE intrinsics in SSE-disabled code
18:26:00 <mrvn> <source>:8:9: error: '_addcarry_u64' was not declared in this scope
18:26:00 <heat> include the header
18:26:00 <heat> #include <x86intrin.h> ?
18:27:00 <heat> ah no, #include <immintrin.h>
18:28:00 <mrvn> both work on godbolt
18:29:00 <heat> what are you doing with those? internet checksum?
18:32:00 <mrvn> https://godbolt.org/z/3KaqMvE5P the intrinsic doesn't produce adox
18:32:00 <bslsk05> godbolt.org: Compiler Explorer
18:32:00 <mrvn> nor adcx for that matter
18:33:00 <mrvn> heat: big nums
18:33:00 <GeDaMo> Do you need to specify a machine to the compiler?
18:33:00 <heat> mrvn, it's addcarryx
18:33:00 <heat> you're using the adc intrinsic :)
18:34:00 <heat> hmm even then
18:36:00 <mrvn> both produce just adc
18:38:00 <heat> i can't get gcc or clang to gen those instructions
18:38:00 <heat> wtf
18:38:00 <mrvn> From what I saw in gcc bug reports gcc simply doesn't have the capability to track C and O chains for carry. It only has a flag that something modifies the CC flags.
18:39:00 <heat> icc does support it
18:39:00 <heat> just use icc, ez
18:39:00 <heat> you're probably better off using inline asm though
18:40:00 <mrvn> And what is wrong with clang here? https://godbolt.org/z/WKT1M7MYx
18:40:00 <bslsk05> godbolt.org: Compiler Explorer
18:41:00 <j`ey> restrict in the "wrong" place
18:42:00 <j`ey> "Big & __restrict__ a"
18:44:00 <mrvn> thx. And nox clang complains my target has no adx feature. How do I turn on cpu features in clang?
18:44:00 <mrvn> s/nox/now/
18:48:00 <GeDaMo> -madx is a thing https://clang.llvm.org/docs/ClangCommandLineReference.html
18:48:00 <bslsk05> clang.llvm.org: Clang command line argument reference — Clang 15.0.0git documentation
18:50:00 <mrvn> No adcx not adox with clang either.
18:58:00 <GeDaMo> https://github.com/microsoft/clang/blob/master/test/CodeGen/adx-builtins.c
18:58:00 <bslsk05> github.com: clang/adx-builtins.c at master · microsoft/clang · GitHub
19:03:00 <mrvn> GeDaMo: that generates adc, not adcx/adox
19:04:00 <mrvn> Also: uint64_t != unsigned long long which makes those macros horrible to use. Who came up with that?
19:39:00 <heat> if you're unhappy with clang you can always submit a patch and wait indefinitely for the patch to get merged without any feedback whatsoever
19:39:00 <heat> </butthurt>
19:40:00 <ddevault> how can I determine what regions of physical memory are used for mmio on x86, as distinguished from general purpose RAM?
19:40:00 <heat> ddevault, MMIO isn't marked as available on the memory map
19:40:00 <zid> you can't, really, but you could ask someone to tell you
19:40:00 <zid> like the e820
19:41:00 <ddevault> is it not more granular than "all physical addresses which are not marked as available"?
19:41:00 <heat> depends
19:41:00 <heat> on the EFI memory map? possibly
19:41:00 <zid> all physical addresses may or may not refer to a device
19:41:00 <mrvn> Is any region outside of some pci mapped device used for MMIO?
19:41:00 <zid> the cpu can't really tell
19:41:00 <heat> mrvn, local apic, io apic
19:41:00 <ddevault> alright
19:41:00 <heat> $chipset_stuff
19:41:00 <ddevault> I'll just assume all unavailable memory is potentially useful for devices
19:41:00 <heat> why do you care?
19:42:00 <zid> yea seems kind of inside out
19:42:00 <heat> any not-available memory is not available :)
19:42:00 <zid> normally you'd just collect up your system's information and use it, not try to rverse engineer it
19:42:00 <ddevault> distinguishing device memory from non-device memory for page allocation
19:42:00 <ddevault> microkernel, so userspace should be able to map device memory
19:42:00 <heat> <heat> any not-available memory is not available :)
19:42:00 <zid> finding out that 0xDEADBF0012 *doesn't* do anything is less useful than finding out which memory *is* useful
19:42:00 <geist> generally what you do is start off by figuring out what is memory. that's what e820/efi/etc tell you
19:43:00 <geist> so that's good, you know now that anything that's outside of that is potentially device memory
19:43:00 <heat> also, there's memory which is there but you can't touch it
19:43:00 <geist> finding device memory is then a case of going through bus specific mechanisms to discover or allocate
19:43:00 <ddevault> I assume if I map some bogus physical address into an address space and userspaces tries to use it, it will just create a fault (from which I can recover)
19:43:00 <heat> see ACPI NVS, EFI runtime services data, SMM data (which the chipset disallows to touch)
19:43:00 <mrvn> ddevault: no, it will just do something undefined
19:44:00 <ddevault> oh
19:44:00 <ddevault> well that's great
19:44:00 <mrvn> most likely ignore write and read 0
19:44:00 <heat> in x86 you usually get all-ones
19:44:00 <geist> ddevault: that's right. so you shouldn't allow that. or if something tries to map a random physical address there should be some amount of pre-determiniation that it's something valid
19:44:00 <ddevault> ah that's fine
19:44:00 <mrvn> or 1
19:44:00 <geist> ie, a PCI bus scanner that discovers all of the BARs and then only lets drivers map the bars
19:44:00 <ddevault> the issue, geist, is that the PCI driver is in userspace
19:44:00 <geist> sure. but you know where the starting point is from parsing ACPi/etc
19:45:00 <mrvn> ddevault: so? it should ask the PCI Scanner for mapped memory
19:45:00 <heat> let the PCI driver map everything and let it carve out parts of the address space for client drivers
19:45:00 <geist> so the bus driver starts by mapping the ECAM/etc, then scans the pci bus and as a result of that knows all the BARs
19:45:00 <geist> the trick is of course ACPI/etc but its not turtles all the way down. at some point you get to some root of truth where to start a search
19:45:00 <ddevault> yeah, and ideally it's rooted in userspace
19:46:00 <geist> yep. searching for ACPI is kinda annoying, but actually UEFI tells you where it is
19:46:00 <geist> or the root RSDP is in a known range of memory so you can map that and search it (the 640k hole)
19:46:00 <ddevault> eh, it's not that annoying
19:46:00 <geist> yah, exactly.
19:46:00 <ddevault> but it still doesn't semantically belong in the kernel
19:46:00 <ddevault> so if I can avoid it, I shall
19:47:00 <ddevault> (but I probably can't, because SMP)
19:47:00 <geist> but anyway the end result is that drivers shouldn't just willy nilly map things, or even better be *allowed* to map things
19:47:00 <heat> in fuchsia you get mmio as a vmo as well right?
19:47:00 <geist> for fuchsia, for example, only the root drivers have the ability to synthesize a physical memory object, so part of say a PCI driver startup, it's *handed* the objects it needs to map
19:48:00 <mrvn> ddevault: You should have some secure "token" to allow mapping memory ranges. And a owner of such a token can create new tokens for sub-ranges. Then you can start the PCI Bus Scanner with a token for the whole PCI memory range. That then creates sub tokens for the individual devices and gives them to each driver.
19:48:00 <geist> exactly. so the driver itself doesn't have any rights to just map something anywhere
19:48:00 <ddevault> aye, mrvn, I understand
19:48:00 <geist> the PCI bus driver has the necessary authority to construct the physical mappings on behalf of the drivers
19:48:00 <zid> and tbh, I'd just prefer that as an API anyway, "give me the mmio region for device 4" rather than "is address 37 good or bad so I can remap a pci device there?"
19:48:00 <geist> right
19:48:00 <ddevault> anyway, I am starting to conclude that PCI will probably have to live in the kernel, at least partially
19:48:00 <heat> no
19:48:00 <geist> it can be done in user space, it's just kinda messy
19:48:00 <mrvn> ddevault: no, it just has to ask the kernel to map the memory
19:49:00 <ddevault> yes, but the kernel has to determine if the physical address the user wants to map is sane
19:49:00 <geist> and there may be some necessary evil, especially as it pertains to interrupt mapping, where the pci driver has some special perms to talk to kernel space
19:49:00 <heat> no it doesn't
19:49:00 <heat> it *may*, but it doesn't
19:49:00 <geist> if you simply trust the pci driver to not be busted give it the authority to just synthesize mappings of any physical
19:49:00 <ddevault> well, it depends on what happens when you write to or read from an invalid physical address
19:49:00 <ddevault> hence the original question
19:49:00 <mrvn> ddevault: look at what I wrote before. The kernel only limits the PCI Bus Scanner to the PCI memory region. Everything else is user space.
19:49:00 <geist> you've distributed a bit of trust around, but the pci driver has a lot of authority
19:50:00 <ddevault> mrvn: yes, I understand
19:50:00 <heat> ddevault, anything that has the capability to map random physical memory should be trusted
19:50:00 <heat> it's the only way
19:50:00 <ddevault> my god
19:50:00 <ddevault> forget it
19:50:00 <heat> ok
19:50:00 <geist> oh? what's wrong? too many answers?
19:50:00 <mrvn> ddevault: you will need a similar abstraction for DMA
19:50:00 <geist> trying to be helpful
19:51:00 * geist queues up the Too Many Cooks youtube
19:51:00 <mrvn> geist: now I'm hungry
19:51:00 <geist> heh
19:54:00 <geist> okay so now in my new server board the cpu fan is turned sideways, which means it' generally blowing against the top of the case
19:54:00 <geist> i have another identical case with an actual fan mount there, so i guess i should swap the motherboard with it and get a top exhaust fan which should help
19:55:00 <geist> server definitely runs hotter under load. easily bumps past 80c and then the cooler fan spins up pretty loud
19:55:00 <heat> any fan is already miles better than a macbook's thermal design
19:55:00 <heat> yes yes I know I won't shut up about it, let me rant about macbooks while I have it
19:55:00 <geist> dunno have you used one of the new M1 macbooks? you gotta really work at it to get it to heat up
19:55:00 <heat> no I have the last intel macbook pro
19:55:00 <geist> though i haven't used an air, maybe they're a bit worse, though AFAIK they just dont have a fan
19:56:00 <geist> well if it makes you feel any better the M1s basically just dont spin the fan up, or if they do it's basically silent
19:56:00 <heat> the avg on-load temp is like 96C
19:56:00 <heat> right, I feel like that's the issue
19:56:00 <heat> make it spin pls
19:57:00 <mrvn> heat: if it's designed to run that hot what is the problem?
19:57:00 <geist> their new solution seems to just be 'make the cpu so efficient it doesn't need much cooling'
19:57:00 <heat> it deeply annoys me
19:57:00 <mrvn> the hotter the cpu the more efficient the cooling
19:57:00 <heat> also touching really hot aluminium isn't pleasant
19:57:00 <zid> It's never going to be good just because there's less laptop
19:57:00 <geist> i have a lenovo thinkpad 11th gen for fuchsia testing upstairs, and it almost instantly starts heating up under load
19:58:00 <mrvn> the aluminium isn't 96°C
19:58:00 <geist> and silly they put i think the intake vents on the bottom
19:58:00 <geist> so if you have it on your lap it really heats up fast
19:58:00 <heat> mrvn, it's not but it's still warm
19:58:00 <zid> if it were 2-3kg of laptop it'd heat up a LOT slower :p
19:58:00 <mrvn> heat: so what you really want is the fan to be controled by the cases outside temp.
19:59:00 <heat> if the CPU never reached 100C comfortably this just wouldn't be an issue
19:59:00 <zid> what you need to do is remove the thermal paste
19:59:00 <mrvn> it's not a laptop even if named thus, don't put it on your lap. :)
19:59:00 <zid> the cpu will throttle more and the body won't heat up as much
20:00:00 <zid> best way to increase testicle thermals
20:01:00 <heat> yeah right idk
20:01:00 <heat> it's only a laptop if you're not doing actual heavy work on it
20:01:00 <heat> and if you're not, why do you have a 2000 euro laptop
20:02:00 <heat> I should try an M1 one though
20:02:00 <heat> since that's massively better
20:02:00 <mrvn> I don't think they made any laptops in the last decade, only mobile systems wiht a screen+keyboard.
20:02:00 <mrvn> laptops are now called phones.
20:03:00 <heat> i can put my cheap laptop on my lap no problem
20:03:00 <mrvn> heat: plastic case?
20:03:00 <heat> yes
20:03:00 <heat> also hopefully the one I'm hopefully getting will not have these issues
20:03:00 <mrvn> heat: still surprises me. No heat vents on the bottom?
20:03:00 <heat> dell latitude 7420
20:04:00 <heat> mrvn, hrrrm
20:04:00 <heat> kinda?
20:04:00 <heat> it's usable-hot
20:04:00 <zid> heat: What about we water-cool your legs?
20:04:00 <zid> here's a 3L bottle of water, finish it quickly
20:04:00 <mrvn> Mostly you either block the vents or the 50°C air blowing out of it gets anoying.
20:04:00 <heat> zid, no liquid nitrogen?
20:05:00 <zid> Okay here's a genuine idea, tape an asbestos tile to the bottom
20:05:00 <mrvn> Every laptop should have a designated hot plate to place your coffe on.
20:05:00 <GeDaMo> Lick the tile first so it will stay in place :P
20:06:00 <zid> lick it all you want just don't take a hand-file to it and start huffing :p
20:06:00 <heat> my nuts look swollen and purple
20:06:00 <heat> you sure that's a genuine solution?
20:06:00 <zid> Your medical issues don't impact my solution's efficacy dw
20:07:00 <zid> You need a doctor not an engineer
20:07:00 <heat> ok wonderful
20:07:00 <heat> sgtm
20:07:00 <heat> just wanted to check
20:14:00 <mrvn> Does x86_64 have the opposite of setc/seto?
20:15:00 <GeDaMo> setnc/stno?
20:15:00 <GeDaMo> Er setno
20:16:00 <mrvn> I mean set the CC to what's in a register
20:16:00 <zid> sets the no flag, which causes all calls to fail
20:17:00 <zid> "call" "computer says no."
20:17:00 <GeDaMo> Not directly
20:18:00 <GeDaMo> neg al; add al, 1; maybe? Assuming al is 0 or 1
20:21:00 <mrvn> GeDaMo: neg already sets CF
20:21:00 <mrvn> The CF flag set to 0 if the source operand is 0; otherwise it is set to 1. The OF, SF, ZF, AF, and PF flags are set according to the result.
20:21:00 <mrvn> I need to set both CF and OF to specific values.
20:22:00 <GeDaMo> Even better! :P
20:22:00 <mrvn> looks like the only way for that is and extra adcx/adox .
20:23:00 <moon-child> mrvn: use the 'loop' instruction for your loops, then you don't have to save/restore c and o
20:23:00 <moon-child> :)
20:23:00 <moon-child> (don't actually do this)
20:23:00 <heat> USE IT
20:24:00 <heat> also use AAA and AAD
20:24:00 <zid> I wish
20:24:00 <heat> and enter + leave
20:24:00 * |Test_User complains about how loop is in it's own loop of reasons to not use - no one uses it so they didn't bother optomizing it, not optomized so no one uses it
20:24:00 <GeDaMo> You can push and pop the flags
20:24:00 <moon-child> leave is fine
20:24:00 <moon-child> enter is crap
20:24:00 <moon-child> pushf/popf are slow
20:24:00 <heat> pushf and popf aren't slow
20:24:00 <zid> heat: That's how I emulate DAA on SM83, I set up a 32bit compat segment selector, lgdt it, reload cs, do daa, then do the reverse
20:24:00 <mrvn> moon-child: Oh, I thought loop would still change cf. but that's even better
20:25:00 <heat> pushad and popad are slow though
20:25:00 <heat> are thus, great
20:25:00 <mrvn> heat: ASCII Adjust After Addition?
20:25:00 <heat> yes
20:25:00 <moon-child> heat: popf is 13 cycles
20:25:00 <moon-child> (on zen2, at any rate)
20:25:00 <heat> ok
20:25:00 <heat> but you need it
20:25:00 <heat> how about that
20:25:00 <mrvn> heat: whatever for should i use that?
20:26:00 <heat> idk i'm just listing my top 10 x86 instructions
20:26:00 <mrvn> My top instruction is sex
20:26:00 <heat> i've never used that
20:26:00 <moon-child> mrvn: you can instruct me any time you want
20:26:00 <mrvn> heat: oh, a virgin. :)
20:26:00 <heat> :D
20:27:00 <heat> that is also not a thing wtf
20:27:00 <heat> lying bastard
20:27:00 <GeDaMo> https://hbfs.wordpress.com/2008/08/05/branchless-equivalents-of-simple-functions/
20:27:00 <zid> my top 10 is a wildcard and mod/rm
20:27:00 <bslsk05> hbfs.wordpress.com: Branchless Equivalents of Simple Functions | Harder, Better, Faster, Stronger
20:27:00 <mrvn> sex == sign extend on some archs but x86 chickend out and calls that something else.
20:28:00 <Griwes> well if you go with sex then you also have to go with zex and that's just weird
20:28:00 <heat> movsx
20:28:00 <heat> and movzx
20:30:00 * Griwes is "looking forward to" writing his exception handling code that will need to sign-extend n*7-bit integers
20:30:00 <moon-child> not that hard, really
20:30:00 <moon-child> shift left, then right
20:31:00 <heat> Griwes, why?
20:31:00 <Griwes> heat, because C++ language specific exception tables use LEB128 to encode numbers
20:33:00 <klys> store and exchange
20:39:00 <geist> yah iirc SEX is one of the 6809 instructions, at least
20:40:00 <heat> Griwes, oh right, those exceptions
20:52:00 <mrvn> how do I pass a "const uint64_t *pb" in the "s" register to inline asm?
20:53:00 <mrvn> "rsi" register
20:53:00 <heat> https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints
20:53:00 <bslsk05> gcc.gnu.org: Machine Constraints (Using the GNU Compiler Collection (GCC))
20:53:00 <heat> good luck making sense of that
20:54:00 <heat> oh that's easy
20:54:00 <mrvn> never mind, it's "S", not "s"
20:54:00 <heat> the S constraint
20:54:00 <heat> it's not 's' because fuck you
20:58:00 <zid> The constraints for rdi, rax, rsi are SaD
20:58:00 <zid> makes mi cry every tim
20:58:00 <gog> alexa play despacito
20:59:00 <mrvn> that's because there is also a "d" register
20:59:00 <mrvn> a,b,c,d for int registers, D,S for pointers
21:47:00 <mrvn> How bad is it to go through a large uint64_t backwards instead of forward?
21:47:00 <mrvn> Has the predictive memory pre-fretcher advanced to the point where it doesn't matter?
21:48:00 <heat> i think the C copy backwards thing in that memcpy bench suite is a good bit slower
21:50:00 <mrvn> https://godbolt.org/z/f4T1vxG9f
21:50:00 <bslsk05> godbolt.org: Compiler Explorer
21:50:00 <moon-child> mrvn: afaik it's fine
21:50:00 <moon-child> I think they will even detect strided access
21:51:00 <mrvn> If I want to use loop then I kind of need to work barkwards through the array
21:51:00 <moon-child> so like if you're touching every other cache line, or every n
21:52:00 <moon-child> loop is slow though
21:52:00 <heat> s/slow/very fast/
21:52:00 <heat> there should be an llvm mode to use crap instructions
21:53:00 <heat> one better than -O0 that is
21:53:00 <mrvn> Is there a syntax for %rdi + 8*%rcx - 8?
21:53:00 <heat> yes?
21:53:00 <moon-child> gcc will do an actual division by constant in __attribute__((cold)) code
21:53:00 <moon-child> mrvn: I think-8(%rdi,%rcx,8), but idk att
21:53:00 <mrvn> moon-child: cold code is optimized for size
21:53:00 <heat> 8($rdi, 8, %rcx)
21:54:00 <moon-child> disp is -8, not 8
21:54:00 <moon-child> also rdi is %, not $
21:54:00 <heat> oops
21:54:00 <mrvn> https://godbolt.org/z/afYaEdcEW thx
21:54:00 <bslsk05> godbolt.org: Compiler Explorer
21:54:00 <heat> also oops
21:54:00 <heat> i'm writing asm in IRC give me a break
21:54:00 <heat> :P
21:54:00 * moon-child pats heat
21:54:00 <moon-child> there, there
21:54:00 <mrvn> hmm, that broke the result.
21:54:00 * heat mtrrs moon-child
21:55:00 <mrvn> https://godbolt.org/z/1sYMP8fW5 there, fixed
21:55:00 <bslsk05> godbolt.org: Compiler Explorer
21:55:00 <mrvn> I want %rdi - 8 * %rcx though :(
21:56:00 <moon-child> hmm
21:56:00 <moon-child> store your bigints in big endian
21:56:00 <heat> https://godbolt.org/z/xYG1zbns7
21:56:00 <bslsk05> godbolt.org: Compiler Explorer
21:57:00 <heat> compiler smart, know syntax
21:57:00 <moon-child> so did I, apparently
21:57:00 <heat> SO DID I
21:57:00 <heat> SUCK IT
21:57:00 <mrvn> moon-child: and how do you index that then with the loop variable?
21:57:00 <heat> you're very into the loop instruction
21:57:00 <moon-child> mrvn: then you just index with the loop variable
21:58:00 <moon-child> another option: don't index at all; instead, walk your pointers forward. use lea whatever,[whatever + 8] (lea doesn't set flags)
21:58:00 <heat> inc 8 times
21:58:00 <heat> manual add unrolling :P
21:58:00 <mrvn> moon-child: wait. I'm already doing big endian. I want litte endian
21:59:00 <mrvn> heat: inc changes CF iirc
22:00:00 <dminuoso> mrvn: Intel is documented to detect forward and backward strides in E.3.4.2 of Intel® 64 and IA-32 Architectures Optimization Reference Manual
22:00:00 <moon-child> agner sez of k8/k10 'Data streams can be prefetched automatically with positive or negative strides'
22:00:00 <moon-child> presumably applies to newer parts too
22:00:00 <dminuoso> Oh but wait
22:01:00 <dminuoso> That's for the IP prefetcher, not the DCU prefetcher
22:01:00 <mrvn> instructions cn be prefetched with negative stride?
22:01:00 <dminuoso> As far as documentation goes, only by the IP prefetcher
22:01:00 <dminuoso> So as long as you explicitly have load instructions
22:02:00 <mrvn> I mave mov (read), adox (read), mov (write)
22:02:00 <moon-child> dminuoso: 'explicitly have load instructions' as opposed to what, regular instructions w/memory operands?
22:04:00 <dminuoso> moon-child: Presumably, yes. The IP prefetcher, as far as I can read it, triggers on loads but only if a particular set of conditions are met.
22:04:00 <dminuoso> The DCU prefetcher seems to be triggered on just any meomry operand
22:05:00 <dminuoso> Its not clear whether DCU can work out negative strides, but the IP prefetcher is documented to support them (again but only under special circumstances)
22:05:00 <dminuoso> Well thats to L1 anyway
22:05:00 <dminuoso> There's also L2 prefetching
22:06:00 <dminuoso> For L2 prefetching it works in both directions
22:06:00 <dminuoso> Really just read the manual
22:06:00 <dminuoso> E.3.4.2 and E.3.4.3
22:12:00 <mrvn> Finally addition of 2 Big nums in parallel: https://godbolt.org/z/eWjhPd33j
22:12:00 <bslsk05> godbolt.org: Compiler Explorer
22:13:00 <heat> ok backwards copy doesn't seem to have a big impact on my CPU
22:13:00 <heat> i was wrong
22:14:00 <geist> yah i think pretty much any halfway modern design can prefetch reverse just as well
22:15:00 <geist> maybe it takes a bit longer to train it, possibly
22:15:00 <heat> seems to have a tiny, tiny impact (around 40MB/s)
22:15:00 <heat> might just be noise ofc
22:15:00 <geist> i also remember reading that the cortex-a53 can detect N strides in parallel, etc, and that's a pretty dumb machine. i think it's just a given nowadays
22:16:00 <mrvn> reading 4 Big nums and writing 2 back might throw it off
22:17:00 <mrvn> geist: it should at least detect 2 strides. Reading one and writing another is very common.
22:17:00 <mrvn> 3 strides is common too
22:29:00 <geist> yah
22:40:00 <mrvn> heat: how fast is it that 40MB/s is tiny?
22:54:00 <heat> 4000MB
22:54:00 <heat> /s
22:54:00 <heat> and this is relatively slow, just a laptop with a ULP kabylake R
22:55:00 <heat> https://gist.github.com/heatd/e83005662c837800fd5273934923a42b
22:55:00 <bslsk05> gist.github.com: gist:e83005662c837800fd5273934923a42b · GitHub
22:55:00 <mrvn> 1% then
23:10:00 <doug16k> neat to compare to 3950x with "slow" dual channel 2400 ECC memory with TSME enabled: https://gist.github.com/doug65536/bba271d46469e73790679ace14b6c408
23:10:00 <bslsk05> gist.github.com: 3950x, 2400 ECC, dual channel, TSME enabled · GitHub
23:10:00 <doug16k> it beat me at memset
23:11:00 <zid> my sandy gets 40GB/s on this with cheap ram if memory serves
23:15:00 <heat> doug16k, hey!
23:15:00 <heat> long time no see!
23:15:00 <doug16k> yeah
23:16:00 <doug16k> client says june 2021
23:16:00 <doug16k> sorry, july
23:18:00 <doug16k> I wish I could get 64GB 3200 ECC, but everyone has buffered unbuffered registered unregistered in the product details, to spam search indexes, so I can't find any
23:19:00 <zid> correct
23:19:00 <zid> finding ram is *impossible*
23:19:00 <zid> My favoure is "32GB ECC" "Showing results for 3x2GB non-ECC"
23:19:00 <mrvn> 3rd shelf, half way up in the box labeled RAM
23:20:00 <mrvn> pretty easy to find
23:20:00 <zid> Why can't people just properly mark things as UDIMM, etc :(
23:27:00 <doug16k> heat, were you saying your memcpy was 40MB/s? sounds uncached
23:28:00 <mrvn> doug16k: backwards is 40MB/s slower than forward
23:28:00 <doug16k> ah
23:32:00 <zid> oh I forgot I pulled some dimms when I was testing why my cpu was crashing, crap
23:32:00 <zid> someone remind me to fix that next time I say I am bored kthx
23:32:00 <doug16k> is the forward copy using big moves and backward one always uses bytes?
23:33:00 <mrvn> why would it?
23:33:00 <doug16k> because backward one might need to be byte
23:33:00 <doug16k> right?
23:33:00 <mrvn> we are talking about memcpy, not memmove
23:34:00 <doug16k> then why is it ever backward?
23:34:00 <mrvn> because then it can use loop and use rcx as index
23:34:00 <doug16k> you can already
23:35:00 <doug16k> all you have to do is point past the end of the memory and count from negative up to zero
23:35:00 <mrvn> loop is counting down, not up
23:37:00 <doug16k> if you point rdi to the *end* of the array, and negate the count in rcx, then you could access [rdi+rcx*4] and inc rcx jnz
23:37:00 <doug16k> is that what you mean by using rcx for index and count?
23:37:00 <mrvn> doug16k: you can. But that's not loop
23:37:00 <doug16k> why not?
23:37:00 <doug16k> ok loop then
23:37:00 <mrvn> Plus inc changes CF which brakes my case of looping adcx/adox
23:38:00 <doug16k> if you want it to take more cycles for nothing
23:38:00 <doug16k> ok use loop then*
23:38:00 <mrvn> doug16k: the question was wether it would be slower or not.
23:41:00 <doug16k> historically, loop has been intentionally slow
23:47:00 <mrvn> doug16k: it's strange. loop should be faster than everything else since it doesn't add any dependencies.
23:48:00 <doug16k> if the bigint were that big, I'd have at least 8 adc unrolled, then use setc at the bottom then cmp the something to get carry back?
23:48:00 <doug16k> ...at the top
23:48:00 <mrvn> doug16k: https://godbolt.org/z/nMsWeMnsn
23:49:00 <bslsk05> godbolt.org: Compiler Explorer
23:49:00 <mrvn> doug16k: The goal was to not have to save/restore any flags.
23:49:00 <doug16k> me too, that's why it didn't pushf popf, but I know what you mean
23:50:00 <doug16k> it will be free though, out of order will put that through for nothing
23:50:00 <doug16k> almost. it will kind of overlap with the loop overhead
23:50:00 <mrvn> If you use anything but "loop" then I think you have to setc, seto at the end and an extra adcx, adox at the start.
23:52:00 <doug16k> how big is the bigint. if it is so big this loop overhead matters, you might be bottlenecked on memory anyway
23:52:00 <mrvn> doug16k: anywhere from millions to 2 words.
23:53:00 <mrvn> is popf/pushf slower than adcx, adox, setc, seto?
23:54:00 <mrvn> +2 xor
23:54:00 <heat> you should profile things
23:54:00 <heat> also llvm-mca
23:55:00 <doug16k> mrvn, you unroll some, you don't hammer the setc seto for each one
23:55:00 <doug16k> needs to be like memcpy where it has large and small behaviour
23:56:00 <mrvn> doug16k: obviously. That's something to measure too. Can't unroll forever of the icache runs dry.
23:56:00 <doug16k> I mean 8
23:56:00 <doug16k> has to fit the uop cache
23:57:00 <mrvn> doug16k: It's 6 opcodes per iteration.
23:57:00 <doug16k> or more. just make it high enough that the setc seto disappears because it fit it through for nothing, overlapping the loop counter
23:57:00 <mrvn> 4 opcodes to advance the pointers, 6 opcodes for the flags.
23:58:00 <mrvn> If I unroll it for 2 iterations it's 12 opcodes + 10 opcodes overhead. That's probably slower.
23:59:00 <mrvn> 8x unroll would be 48+10
23:59:00 <doug16k> there are no opcodes once it predicts that branch at the bottom taken. it will stream already-decoded ops into the reorder buffer