00:02 raiph joined
dalek arVM: 4538f61 | jnthn++ | src/ (3 files):
Cache dynlex lookups.

As suggested by TimToady++, we stash them within frames, so we get lifetime management for free (including if continuations happen). We poke it a few frames down the stack at various intervals, to try and maximize the benefit. Can likely tune this a bit more yet.
timotimo nice :) 00:14
jnthn Need to lose another 1.48s before I can say I can build Rakudo in 70s. :) 00:15
timotimo how much is that worth? 00:16
how much did your last commit improve build times? 00:17
00:17 avuserow joined
jnthn Was about another second off the Rakudo build. 00:19
So, at least a %
timotimo sweet!
you said you timed it at about 1.8% recently; so maybe you halved the time spent in dynvar lookups? :) 00:20
jnthn Yeah; I'll need to do a C level profile again at some point. 00:21
Wowza. Attempting to optimize junctions creates 103966 closures when compiling CORE.setting... 00:22
timotimo attempting?
jnthn Well, we may succeed
timotimo how did i do that :(
too many non-inlined blocks?
jnthn non-inlinable 00:24
3 nested subs 00:25
The optimizer (and I'm guilty too) has some very large methods in it.
timotimo ah, those nested subs could be un-nested and just take more arguments
that would help, right?
jnthn Which aren't too maintainer friendly, but aren't exactly spesh-friendly or optimizer friendly either
Well, trying an easier refactor that's probably as effective.
timotimo OK 00:26
jnthn Basically, pull the transform into a separate method from the analysis. 00:27
Yes, that helps a lot. 00:29
Though it's not the biggest source of issues, just the most stand-out one
timotimo how do you measure what part of the process generates how many closures? 00:33
jnthn Patch to takeclosure in frame.c that just prints out the outer frame name. 00:35
timotimo ah, OK 00:37
and a | sort | uniq -c | sort -n
00:50 colomon joined 01:48 FROGGS_ joined 01:56 cognominal joined 02:01 FROGGS_ joined 03:00 jimmyz joined
jimmyz Stage parse : 36.933, 1.4s lower since yesterday :) 03:00
03:01 tadzik joined, ventica joined, cognominal joined 03:02 avuserow joined
xiaomiao I wonder what the standard deviation of those benchmarks is ;) 03:31
37sec +-1 sec, that's about 3% ... that could be "noise"
03:52 ilbot3 joined 03:56 ventica joined 05:18 avuserow joined 05:35 bcode joined 05:57 avuserow joined
sergot o/ 06:18
07:19 ventica joined 07:24 cognome joined 07:30 cognome_ joined 07:50 ventica joined 08:01 ventica joined 08:13 zakharyas joined 08:18 Ven joined
masak \o 08:39
nwc10 o/
08:47 FROGGS[mobile] joined 08:53 brrt joined 09:06 brrt joined, brrt left
jnthn o/ 09:21
nwc10 OK, so one key part of testing is "don't fill the disk" 09:31
masak heh. 09:32
09:33 colomon joined 09:52 jose__ joined
nwc10 m: say 6.7812e+01/6.8598e+01 10:31
camelia rakudo-moar 89c8e4: OUTPUT«0.988541939998251␤»
nwc10 jnthn: that's the setting build speedup, once the disk is only 90% used
jnthn nwc10: Speedup since when, exactly? :) 10:36
nwc10 er, last time I measure it. Which was probably yesterday morning.
grammar/fingers gah 10:37
I'm going to measure parrot performance again, to see if it gained more 10:38
11:05 carlin joined
dalek arVM: b6a9cad | jnthn++ | src/core/frame.c:
Fix an uninitialized variable bug.

12:22 klaas-janstol joined 12:42 oetiker joined
dalek arVM: e92aa36 | jnthn++ | src/6model/ (13 files):
De-virtualize most reader functions.

No point to call the same thing every time through a function pointer.
arVM: 9a3a96d | jnthn++ | src/6model/serialization.c:
Bump minimum serialization format version.

This in turn enables us to assume we have varints in the thing we are reading, which we have for quite a while now.
arVM: 6da5b90 | jnthn++ | src/6model/ (9 files):
De-virtualize read_var_int.

arVM: f55e682 | jnthn++ | src/6model/ (14 files):
De-virtualize serialization write functions.

Again, the abstraction was unused and unrequired.
nwc10 m: say 6.597e+01/6.7812e+01 13:55
camelia rakudo-moar 4d347f: OUTPUT«0.972836666076801␤»
nwc10 er, so that's 2.5% speedup since this morning
FROGGS[mobile] O.o 14:06
14:15 zakharyas joined
[Coke] does the .msi have rakudo-moar in it? 14:36
so, I have someone who is a cpan module author, who has written XS stuff, has hacked on perl core in the past... and he's too intimidated to use perl 6. 14:37
timotimo to *use* it?
interesting, should be a good "test subject" :)
[Coke] even to get a copy setup to play with. 14:38
timotimo he's on windows, yeah?
froggs had a rakudo star moarvm msi release candidate at one point
[Coke] well, step one, we need a better story on perl6.org.
timotimo nobody tested it, so it disappeared again
btyler perl6 is extremely intimidating at first, because the vast majority of the code you encounter 'casually' is straight from the core rakudo crowd 14:39
and that code tends to be rather dense, in the interest of maximally demonstrating power in minimal space 14:40
[Coke] I think we could borrow some ideas from the mojolico.us site.
btyler most perl 5 code you might encounter randomly is more or less baby perl
[Coke] btyler: he's not even at code. too many options before that. 14:41
btyler ah, sorry, projected from my own experience too much :)
timotimo that does happen, yeah :( 14:44
i thought about giving perl6.org an "express lane"
[Coke] [Coke] creates a playground to test with... 14:47
timotimo what is this "playground"? :)
[Coke] a fork.
timotimo ah, of course 14:48
[Coke] [Coke] trips over the prereqs. whoops. 14:56
hoelzro [Coke]++ 15:00
[Coke] ... I thought I was in #perl6 this whole time. 15:05
timotimo ah
[Coke] whoops
15:18 ventica joined
dalek Heuristic branch merge: pushed 16 commits to MoarVM/moar-jit by jnthn 15:29
jnthn brrt: Updated moar-jit to master, are confirming it works. :)
timotimo "are confirming"? 15:35
jnthn *after
timotimo ah, excellent!
and even jit-moar-ops is in there
things are looking mighty fine :) 15:36
jnthn Except that things are slower with the JIT enabled... 15:37
timotimo yeah 15:38
probably just spending too much time aborting frames, still?
jnthn Not sure yet 15:39
Seeing if I can discover anything.
timotimo have you counted how often the jit-invocation opcode got hit?
dalek arVM/moar-jit: 3f22397 | jnthn++ | Configure.pl:
Make dynasm rule work on nmake.

arVM/moar-jit: bafbc3b | jnthn++ | src/jit/emit_win32_x64.c:
Win32 JIT output was behind.

jnthn Oddly, my profiler claims that we spend 6% of the time in JITted code, but the time spent in the interpreter only goes down by 1% 16:00
timotimo oh, huh? 16:02
but the jitted code ought to be at least a bit faster, right? 16:03
hm, except
if gcc strongly optimizes the interpreter loop, maybe it handles moving stuff from register to register directly instead of going through our locals storage?
i don't quite see how that would be doable without "unrolling" the interpreter loop, though 16:04
16:04 ventica joined
lizmat btyler / [Coke] : TheDamian gave a nice example of how he ported a perl 5 utility of his to perl 6 16:07
at OSCON, wonder where that code lives nowadays
japhb lizmat: Is there a video of that? 16:25
timotimo i want to know, too
lizmat yes, check out OSCON videos :-)
japhb 2014?
timotimo jnthn: does the jit dump the generated bytecode to files, perhaps? 16:29
Got negative offset for dynamic label 6 - i wonder where that comes from? 16:30
jnthn Not by default, afaict 16:31
timotimo even with a jit log i get 32.407 for stage parse 16:32
that's not too bad, is it?
jnthn If you set MVM_JIT_DISABLE=1 here, it comes out slower than with JIT, though. 16:33
uh, faster than with JIT
timotimo hold on. 16:34
only about 0.3 seconds 16:35
hm. maybe 0.5 16:36
16:39 cognome joined
timotimo 788 frames compiled 16:39
japhb I'm not sure we can expect the JIT to be loads faster than spesh until we move from "the easy way that works" to "optimizing all the cycles". A JIT is an expensive thing, and you have to win it back with seriously tuned output.
timotimo 882 bails
japhb Especially while the execution flow has to bounce in and out of JIT land
timotimo sp_findmeth is still the king
with 271 16:40
(probably because of much improved bytecode? maybe we have less frames all-in-all now?)
japhb Getting it working with just neutral performance v. spesh is already a good thing, because it would mean the generated code is enough faster to make up for the cost of generating it.
timotimo yes 16:41
japhb oh, timotimo: did you look at the flame chart info I sent you in #perl6 earlier? 16:42
timotimo yes, pretty!
japhb Man, I want that for my Perl 6 code ....
timotimo well, with the "perf" line from that one blog post you can already get that for the c-level stuff 16:43
16:45 cognominal joined, cognome joined
jnthn Thing is that it's hard to explain it as "JIT takes time", when my profiler is telling me 0.1% of the time is spent doing that. 16:46
timotimo hm. how does that measure time spent in c functions called from the jit? 16:47
oh, that number is for "jitting frames"
jnthn yES 16:48
I'm just wondering if it's because CORE.setting's deopt count is epic.
timotimo how come we have "loadlib" ops in "name", "type", "box_target", "positional_delegate" and "associative_delegate"?
jnthn And falling back out of the JIT when deopting is more expensive than a switch-code-in--interpreter deopt. 16:49
timotimo and has_accessor?
jnthn timotimo: um...not sure I follow?
timotimo in the jit bail log i see a bunch of failures with the loadlib opcode
i ... don't think i understand what it does
ah, that op would expect to hit the cache a bunch of times 16:50
i hope the lock contention isn't too bad on that when we get to multithreaded apps. but i don't even know under what circumstances loadlib opcodes are generated 16:51
jnthn loadlib is hot?
timotimo don't think it is
jnthn That'd be...odd
timotimo just 9 bails
ah, loadlib is probably just used to get a handle to a library and then findsym would be used to get at whatever symbols it'd expose 16:52
that sounds like something that could spesh well.
jnthn What are you seeing loadlib in? 16:54
timotimo jit bail log
jnthn For?
timotimo the core setting 16:55
don't let me distract you, it's probably nothing
oh, that could be the methods of the Perl6::Compiler 16:58
jnthn timotimo: Did you do some work on reducing guards at some point? 17:03
origin/split_get_use_facts <- was that pending review? 17:04
17:08 FROGGS joined
FROGGS o/ 17:08
jnthn o/ FROGGS
TimToady \o 17:14
carlin 17:15
17:32 colomon joined
dalek arVM: 9d377a3 | (Timo Paulssen)++ | src/ (3 files):
split get_facts and use_facts from get_and_use_facts.

arVM: be8cfdf | (Timo Paulssen)++ | src/spesh/optimize.h:
fix teh build

arVM: b57061e | jnthn++ | src/spesh/osr.c:
Ensure OSR-triggered optimize is used next invoke.

arVM: 8df127a | jnthn++ | src/ (3 files):
Merge remote-tracking branch 'origin/split_get_use_facts'

arVM: 49f19ca | jnthn++ | src/spesh/log.h:
Tweak spesh log run count.

Bump minimum bytecode version to 2.
jnthn timotimo: merged the branch, thanks :) 17:46
timotimo oh, that 18:53
nice :)
nwc10 Result: PASS 19:01
jnthn Nice. Time to break more stuff :P 19:14
nwc10 other people could just write more tests 19:15
timotimo jnthn: about the loadlib thing i said earlier: there's a bunch of frames that look exactly like this: gist.github.com/timo/9e49a3806f02857a484f
jnthn What on earth... 19:16
[Coke] do we have a pic of some kind somewhere to show the flow of a program through rakudo when it's on Moar? (esp. with the new spesh/jit stuff?) 19:17
timotimo my thoughts exactly.
jnthn No. If you're lucky I might draw one for my YAPC::EU talk though :) 19:18
[Coke] jnthn: perfect, that'd be fine! 19:20
DAMMIT, it's in Sofia!?
I have free beer waiting for me in Sofia! 19:21
... I cannot remember the name of the guy who owes me the beer. *sadface*. it's been too long.
timotimo jnthn: what's keeping us from closing the loop on the "put argument names into callsites" optimization? 19:22
jnthn timotimo: No much; it's just fiddly and annoying to do and will have a fairly low ROI 19:23
timotimo OK then 19:24
timotimo pushes it further to the back :P 19:25
jnthn: would you be interested to sketch out ideas for how to turn spesh into a profiling thingie in the future? 19:27
19:29 ventica joined
carlin [Coke]: ahh, so that's why rakudo 2014.07 is codenamed Sofia 19:29
19:32 FROGGS joined
nwc10 m: say 6.636e+01/6.597e+01 19:35
camelia rakudo-moar 085ab9: OUTPUT«1.00591177808095␤»
nwc10 slight negative speedup since lunchtime. 19:36
jnthn Hmm
Wonder what's to thank for that...
nwc10 but, given I've had repeatable speed diferences depending on the order that object files are linked
there is some level of insanity in performance metrics
arVM: 0043778 | jnthn++ | src/ (3 files):
Split out part of frame deserialization.

The split out part will be able to happen lazily, the first time we need it. (At present that won't be much of a win as we touch many of the frames at startup to install static lexical information; the plan is to move this information into the bytecode file also).
timotimo nwc10: maybe we should start putting -flto into our gcc commandlines? 20:02
jnthn timotimo: How much difference does it make? 20:03
nwc10 I have no good idea about that
timotimo haven't measured yet
jnthn: that commit above combined with the plan you mention in it ... would that make a difference for memory usage? 20:05
like, not using 99% of the frames in core setting would free up a bit of memory?
jnthn timotimo: That's the hope, yes 20:07
timotimo: And maybe a bit of a startup saving too
timotimo i'd like that a whole lot
dalek arVM: 0098c0c | jnthn++ | src/ (5 files):
Preparations for lazy frame deserialization.

arVM: cdda218 | jnthn++ | src/core/bytecode.c:
Switch on lazy frame deserialization.

Or at least, the parts we can easily get away with putting off until later. While it needs further work to take further advantage, NQP shows a 2.2% and Rakudo shows a 1.4% memory reduction for the empty loop program.
timotimo 1.4% would be about 2 megabytes? 20:54
jnthn Yeah, just short of 20:55
21:07 zakharyas joined 21:23 btyler joined
dalek arVM: c65b2a6 | jnthn++ | docs/bytecode.markdown:
Spec static lexical values table in bytecode.

arVM: 9ba5d15 | jnthn++ | src/mast/compiler.c:
No longer need to support Parrot cross-compiler.

It's almost certainly broken beyond repair to cross-compile from Parrot to Moar anyway, so no need to keep these last bits around.
arVM: ac33547 | jnthn++ | lib/MAST/Nodes.nqp:
Update MAST::Frame to hold static lex values.

arVM: c0984eb | jnthn++ | src/ (4 files):
Write static lex values; read but don't apply them

arVM: e64c5eb | jnthn++ | src/core/bytecode.c:
Read in static lexicals.

arVM: f25affb | jnthn++ | src/mast/nodes_moar.h:
MAST nodes can be identified by exact type.

timotimo oh, that ought to help a lot 23:30
we do istype on mast nodes all the time 23:31
oh, that's only for inside the mastcompiler
but it should still help
jnthn It's a small improvement...the cache-only istype is quite cheap anyway
timotimo #define EMPTY_STRING(vm) (MVM_string_ascii_decode_nt(tc, tc->instance->VMString, "")) 23:32
we have a per-tc (or per vm?) empty string nowadays
jnthn per vm
where on earth do we use that macro..
oh, once per compilation
no big saving
timotimo ./src/mast/compiler.c: hll_str_idx = get_string_heap_index(vm, ws, EMPTY_STRING(vm));
jnthn but yeah, feel free to tweak it
timotimo oke 23:33
dalek arVM: ff15814 | (Timo Paulssen)++ | src/mast/nodes_moar.h:
we can use the vm's empty string constant here.

timotimo should i perhaps teach the ascii encoding about strlen(0) strings re-routing them to the global empty string constant if it exists? 23:39
jnthn I think they are widely interned... 23:41
And utf8 would be a better one to teach it
timotimo mhm 23:42
jnthn Fun fact: somewhere in Grammar.pm is a frame with 612 labels 23:43
timotimo oh, cute
is that after inlining?
jnthn No!
timotimo oh wow!
jnthn Well, aside from NQP's block flattening of course. 23:44
timotimo seems pretty jumpy
jnthn Yeah
Well, I'm pondering some MAST::Label changes.
Today, we always make a string name for a MAST::Label, passing it to its constructor
timotimo could be integers, too, right? 23:45
jnthn However, we never - afaik - in the compiler make two MAST::Labels with the same identifier
Well, they could be integers, yes.
The alternative is that they just work by object identity
Which I believe would work with the current codebase.
Saving 8 bytes per MAST::Label
timotimo hey, with jit enabled and latest master i get 30.5 seconds stage parse on my laptop :3
oh, even better 23:46
jnthn But I was then thinking "hm, I have no hash key"
And wondering what happens if I make a linear scan of the labels.
It'll be a C array so not *too* bad.
timotimo even if you have a frame with 612 labels?
jnthn A hash may be O(1) but the constant overhead isn't automatically cheap. 23:47
Well, that's an extreme/rare case.
timotimo that's right
jnthn Most frames are tiny.
We might lose out on the odd extreme one.
timotimo so at least the linear search is going to be limited to each frame individually
jnthn Right.
timotimo that does sound sensible; do you have a histogram of frame sizes or something?
jnthn No
I just looked for maximum ones
But I'm quite used to reading spesh logs :)
And labels <=> basic blocks are clsoe 23:48
timotimo ah, yes
i've not seen any with 4 digit BBs in core setting :)
jnthn I wonder how many labels we create in compilation...
m: say 21160 * 8 23:52
camelia rakudo-moar fb0521: OUTPUT«169280␤»
jnthn That's how much we'd save on MAST::Label directly
But we save all the strings too
m: say 21160 * (6 * 8 #`(string size) + 10 #`(conservative label length estimate) * 4 #`(per grapheme)) 23:54
camelia rakudo-moar fb0521: OUTPUT«1862080␤»
jnthn Not so much I guess. 23:55
Though there's at least 1 intermediate string too, which is the numification of the number stuck onto it.
Well, may give it a go tomorrow to see how it helps 23:56