01:23 SmokeMachine left 01:26 SmokeMachine joined 01:37 SmokeMachine left 01:38 lucasb left 01:52 SmokeMachine joined 03:09 quotable6 left, squashable6 left, unicodable6 left, undersightable6 left, notable6 left, benchable6 left, reportable6 left, statisfiable6 left, greppable6 left, evalable6 left, committable6 left, bisectable6 left, shareable6 left, releasable6 left, nativecallable6 left, bloatable6 left, coverable6 left, nativecallable6 joined 03:10 releasable6 joined, notable6 joined, bisectable6 joined 03:11 committable6 joined, undersightable6 joined, squashable6 joined, quotable6 joined 03:12 statisfiable6 joined, benchable6 joined, coverable6 joined 03:13 shareable6 joined, bloatable6 joined, unicodable6 joined, greppable6 joined 03:14 evalable6 joined, reportable6 joined 05:03 robertle left 06:08 nebuchadnezzar left, nebuchadnezzar joined 06:15 domidumont joined 06:49 brrt joined
brrt \o 07:02
nwc10 o/
07:11 patrickb joined 08:34 robertle joined
jnthn So yesterday I implemented some "is this point in a branch relative to this one" thing as a quick heuristic check, but was thinking, hmm, this is ad-hoc, but there must be a proper way to do this... 09:10
Realized in the pub last night: I wanted postdominance.
Though not sure if it's worth calculating if all we really need it for is a heuristic. 09:11
lizmat
.oO( post dominance for a post modern language )
09:12
nwc10 pubs++
09:15 zakharyas joined 09:27 dogbert11 joined 09:29 dogbert17 left
brrt what is postdominance? 09:58
timotimo probably very similar to the dominance we use to get the SSA form right, but instead of from the beginning towards the end it'd go from the end towards the beginning? 09:59
jnthn brrt: Dominance but with the arrows reversed
Or intuitively, "what do we always exit via" 10:00
(Where dominance is "what do we always enter via")
Woodi would be nice to have in precompiled modules but if runtime needs just heurestics then can be costly ? 10:02
timotimo we only calculate those things for things that are measured to be hot during run time 10:06
calculating it for everything while precompiling modules is wasteful
especially since earlier optimizations can already throw out branches completely
jnthn Well, also probably what the VM optimizer considers a basic block is similar to what Perl 6 does anyway, especially given it has to be re-calc'd after inlines 10:07
brrt oh, I see 10:14
.... that would've been a nice term to have, I had that precise problem half a year ago 10:15
10:55 brrt left 11:17 zakharyas left 11:30 domidumont left 11:31 domidumont joined
timotimo the MVM_op_infos array is about 35 kib big 11:45
i do believe we have a few fields that don't need to be 8 bit wide
timotimo tries bitfield support in C for this 11:46
28 kib, cool.
hmm. we have an MVMuint16 num_operands, but the maximum operand coulnt we allow is 8; the higher values are just for the benefit of PHI nodes 11:47
hm, 28.6 kib actually 11:48
having the operands in-line with the structs themselves means we pay 8 byte for every op, which is 917 bytes 11:50
there's only 4 ops with 8 operands, 6 with 7, 14 with 6 11:53
silly me, it's not 917 bytes, it's 917 * 8 bytes 11:54
m: say 2360 / (917 * 8)
evalable6 0.321701
timotimo about 66% space savings available with a more packed operands storage 11:55
m: say 917 * 8
evalable6 7336
timotimo save 5 more kilobytes off these 28
that's my useless timewaste for the day, thank you very much 11:56
11:58 brrt joined
timotimo but if we see what spesh and the bytecode validator tend to need vs not need, we might get the tiniest bit of performance out of less cache contention 12:00
brrt Well, yes, spesh etc use the op_infos table quite heavily 12:05
so a saving would be nice :-)
timotimo oh, i have to add 917 * 2 back for the savings, because we'll obviously need an index into the ops list
(though actually that can re-use parts that repeat, which would be cool)
collect all operands bits, calculate overlaps, tile the shortest array with it, make update_ops.p6 take ten minutes to complete 12:07
i'm kind of sort of interested now
FWIW, tools/lib/MAST/Ops.pm has a setup much like this already :) 12:08
completely without overlap finding, though
brrt well, if you first make update_ops.p6 take ten minutes, we'll all have a challenge to reduce that to a couple seconds 12:11
timotimo :D 12:12
a simple greedy algorithm should be just fine and very fast
brrt so, you know what else I found
I expect that tiling will Just Work on a linear IR
timotimo oh? 12:13
that sounds good
no changes needed for that part of the code
brrt well, some changes
timotimo which i imagine is a difficult part of the code
like the rest of the code :)
brrt I expect it will be simpler
right now, we do a postorder (bottom-up) iteration, and then a top-down iteration 12:14
but all we really need bottom-up for, is to ensure that nodes are processed before the nodes that reference them; 12:15
that is easily ensured by the order of the linear IR
timotimo nice
sounds like a whole phase can be tossed
brrt and the inverse is true for top-down
not sure about the phase, but we can translate a tree traversal to a linear iteration, without a change in semantics 12:16
timotimo ah, sweet
brrt another thing that the tiler currently does, is create the basic blocks
that would be handled in an earlier phase
(and assigning labels) 12:17
because it translates between the graph form, and the linear form
so if we have a linear form throughout, we no longer need that translation (it'll happen in the perl comopiler)
and finally. Because of the linear IR, I can ensure that any operand loads are emitted before the template that references them. This will eliminate entirely the missing-ancestor problem 12:18
and the optimizer can move them and eliminate them if it proves that this is uneccesary 12:19
I guess the point is that the adding an explicit relative order of operations allows you to work on that relative order 12:20
which isn't possible in the graph IR as that leaves them unspecified
timotimo i've got a preliminary result 12:25
brrt I'm interested? 12:26
timotimo it's built a list of 722 elements
compare that to 917 * 8 elements for "no overlap attempts" 12:27
er
that's "not tightly packed"
and 2360 for "tightly packed"
that's not all yet
because when i find no match, currently i just append the whole thing to the end without checking for a tail overlap
m: say 722 / (917 * 8) 12:30
evalable6 0.098419
timotimo m: say 722 / (2360)
evalable6 0.305932
timotimo i had meant to work on something else today, but that's not a terrible win
m: say 722 - (917 * 8) 12:31
evalable6 -6614
timotimo so a bit more than 6kb savings for the operands array
12:33 brrt left
timotimo if i don't "have to" put the opcodes in at the same order as they are in the input, i can potentially make better decisions 12:41
Cowardly refusing to permutate more than 20 elements, tried 821 12:44
%)
641 is my best score so far based on "just" reordering the ops randomly 12:48
reproducible builds this is not 12:49
except i srand($attempt-number)
632, cool
12:50 sena_kun joined
timotimo and it parallelizes trivially, of course 12:53
oh, hm, but random state is per thread context, and if any of the blocks had an await in it or were otherwise descheduled, they would resume with different random values
12:55 zakharyas joined
timotimo oh, throwing out all the 1-parameter ops at the beginning could be interesting; there's 117 of them, and 12 0-parameter ops, too 12:56
they can just be put in at the end. either they overlap anywhere or they'll go in at the end anywhere, but they shouldn't cause a potential chain to break by being appended at the end 12:57
13:00 Guest2775 left
timotimo i can't find the name of this problem; it should totally be a common comp-sci problem to solve 13:02
13:18 Guest55475 joined
timotimo oh! 13:21
write registers
two opcodes that have a write register can only have overlapping beginnings 13:22
only all-read-operand opcodes can be partially inside or overlapping beginning and end of other ops
13:22 lucasb joined
timotimo ([71 71 71 71 167 71 39 63 39 39 71 55 55 39 71 63 71 71 71 71 39 71 71 63 167 23 63 63 39 39 39 71 63 63 63 71 39 39 63 39 23 23 71 71 71 71 71 71 63 71 63 39 79 39 55 55 55 71 71 79 63 39 71 71 63 63 71 39 63 63 39 71 71 71 39 39 71 39 23 79 71 23 23 55 167 55 167 63 55 71 63 71 71 71 63 39 39 71 167 143 ...] 396 31) 13:27
([2 1 1 0 0 2 1 1 1 1 1 1 1 1 1 2 0 0 0 1 1 0 1 0 2 3 2 1 0 2 0 1 4 1] 34 5)
these are the lists followed by score followed by random seed for all operands with r/w and lex/reg/literal masked out followed by all operands with only their r/w lex/reg_literal 13:28
if we keep those separate, which may be a PITA, we can get a much tighter packing
13:36 robertle left 13:38 robertle joined
timotimo on top of the fact that the values themselves are now smaller, so they could be less than 8 bits per entry, but that's also annoying to deal with depending on the API we have to offer 13:39
anyway, the r/w, lit/reg/lex bits are only 3 out of 8, so i can't just store one in 4 bits per entry and the other in 4 bits per entry 14:02
however
if i do a 4/4 split, even though the values aren't entirely meaningful then, i can get the whole thing stored in 240 bytes 14:03
whereas if i use 8 bits per entry with one list for the upper 5 and one for the lower 3 bits i'd spend 430 bytes
either is better than spending the 624
though of course there's diminishing returns at some point, and i might be going into "actually useless" territory by now 14:04
nine Well if you mange to make it fit in 64 bytes or so, you'd be down to a single cache line :) 14:49
timotimo rather unlikely :D
MFW a sequence that doesn't try to put only parts of the new operand data at the end when no match was found yet is better than sequences that try to find overlaps for the end 15:08
gaaah tantalizingly close to getting the "start" indices for one of the arrays to fit into 8 bits 15:20
271 entries
15:20 domidumont left 15:26 AlexDani` joined 15:29 AlexDani` is now known as AlexDaniel, AlexDaniel left, AlexDaniel joined
AlexDaniel github.com/perl6/problem-solving/p...-488716403 15:30
15:42 robertle left 15:59 patrickb left 16:28 brrt joined 16:36 patrickb joined 16:39 zakharyas left 16:41 squashable6 left 16:42 squashable6 joined 16:47 zakharyas joined 16:55 brrt left 16:58 domidumont joined 17:12 robertle joined 17:34 zakharyas left 18:41 domidumont left 18:44 brrt joined 19:21 zakharyas joined
timotimo only 35 of the first 820 have any marks; that's 2 bytes in every op 19:43
m: say (820 - 35) * 2
evalable6 1570
timotimo and i think the marks are only ever needed by the validator anyway 19:44
brrt you can extract them in a separate array? 19:48
so that the rest is more compact
timotimo sure
would also be possible to have an array of spans that'd be quick to scan through
the only two usages of the mark struct member are in validation.c and ext.c 19:52
so that shouldn't be too bad 19:55
moving it away would already be a good thing i'm sure
i mean ... barely 19:56
but, you know
so, how widespread is bitfield support in C compilers that we target?
MasterDuke can we set up *BSDs in travis, appveyor, or circleci? that's the main platform missing right? otherwise i'd say create a PR and see what ci reports 20:21
brrt if a C compiler doesn't support bitfields, it's not a C compiler at all imho 20:22
timotimo in other words, it would be safe to use bitfields in moarvm code? 20:23
20:40 zakharyas left 20:44 brrt left 21:33 sena_kun left 21:40 sena_kun joined 21:49 sena_kun left 22:02 patrickb left
mornfall bitfields are fine, what isn't are assumptions about their layout 22:18
22:22 Kaiepi joined 22:49 sena_kun joined 23:06 lizmat_ joined 23:08 lizmat left 23:34 Kaypie joined, Kaiepi left 23:37 sena_kun left, sena_kun joined