Boosting performance of Map - HashMap, ST, or FFI

chexxor · December 14, 2018, 2:32pm

gregberns [10:39 PM]
hey I’m doing a PS POC to prove out performance compared to C#/raw JS, and having some pretty significant performance issues with Map, specifically when trying to insert 1 million times into a Map. I’m assuming the creation of new arrays each time is creating the issue. Are there any optimized libs that are faster? ST performance is OK, and I might be able to do some FFI work but not excited about it.
Any thoughts on other structures…. really anything to get performance times up??

John Mendonça [10:42 PM]
I know there is https://github.com/purescript/purescript-foreign-object but i dont know about its performance, etc

gregberns [10:43 PM]
I was looking at that and I’ll go that direction if I need to, but it seems like some of my type support disappears.

monoidmusician [10:46 PM]
it has the same amount of type safety, it’s just a different implementation of Map String, essentially
it has a nice mutable API through ST, and obviously quick lookups through the native JS operation, but it scales worse at other operations (immutable insertion is O(n) I believe, and there’s no sharing done – but I suspect you don’t care about that)

natefaubion [10:50 PM]
Are you just doing a benchmark, or do you really need a million items in a Map?

gregberns [10:51 PM]
hmmm… I can transform my record to Array (Tuple String Record) which gives me my keys and records. Then convert that to the Object. I’m kinda hesitant (probably cause I havent worked with it) just because it might look challenging to others coming in
@natefaubion right now its a POC doing benchmarking. but the organization is totally new to PS and so were a little hesitant to need to do anything too drastic or difficult to understand to get performance gains. After 2 days of perf work, were at 2 minutes compared to 7-10 seconds with C# and it looks like its all about insertion into Map

natefaubion [10:54 PM]
I don’t think you can compare JS objects with C# let alone PS
They are in totally different leagues
If your usage is predicated on insertion of millions of items into a Map, you are using the wrong platform

gregberns [10:55 PM]
haha well… yea I get that, but… with raw JS 1M insertions into an object vs a C# dict is about the same

natefaubion [10:56 PM]
I would use ST then
which is going to be roughly the same as the JS version that mutates an object
But make sure you use unsafeFreeze

gregberns [10:56 PM]
We used ST and got some gains, but its definitely still not there

natefaubion [10:56 PM]
Then use JS. That’s why the FFI exists
You should not expect PS to be as fast as raw JS
Map is an immutable tree data structure, so GC churn for a million items is going to be insane

gregberns [10:57 PM]
please dont get me wrong, I do get that, just struggling to figure out a better solution.
haha yea!

natefaubion [10:58 PM]
How are you using ST?
vs the JS solution?
I think if you are ever in the situation where you need to insert a million items as fast as possible, you should use JS. I would be very surprised if that is the case, and in that case I would question why you want to do it in PS with a JS backend

gregberns [11:00 PM]
we’ve been looking at using the FFI and I think I have something decent for array, but once the insertions are complete I’d really like to return a Map instead of our own object

natefaubion [11:00 PM]
There is no way you can build a Map with a million items in reasonable time

gregberns [11:01 PM]
We are totally open to Haskell, but… record syntax.

natefaubion [11:01 PM]
For a million items there is probably a better data structure than PS Map
You could try the HashMap implementation

gregberns [11:02 PM]
For the POC there are two parts: first build an initial Map, which is just insertions based on Key. The second part is to do a lookup, modify the object, and insert it back in.

natefaubion [11:02 PM]
But anything immutable is going to be rough
for initial construction
If you want fast initial construction you absolutely need a data structure that supports an ST API

gregberns [11:04 PM]
I didnt know about HashMap. Any idea if its lookup perf will be slower?

natefaubion [11:04 PM]
Map does not support that. Building a Map from an Array of Tuples is no different than just calling a bunch of inserts

gregberns [11:04 PM]
(I will definitely try it)

natefaubion [11:05 PM]

gregberns [11:05 PM]

Building a Map from an Array of Tuples is no different than just calling a bunch of inserts

Yea that’s what I realized after looking at the source code

natefaubion [11:06 PM]
However if they are already ordered, I would just use binary search on the array

I think someone wrote an API for that
https://pursuit.purescript.org/packages/purescript-sorted-arrays/0.2.0

gregberns [11:07 PM]
Thats an interesting idea as well
Here’s the ST test we wrote (note: I knew nothing about ST 2 days ago)

test :: Effect Unit
test = do
  let 
    inner :: forall r. Array Int -> ST r (Object Int)
    inner arr = do
      
      map' <- new :: ST r (STObject r Int)
      
      _ <- foreach arr (\i -> poke (show i) (i * 10) map' *> pure unit)

      freezeST map'

  let arr = range 1 1000000
  log "start"
  t0 <- now
  let a = run (inner arr)
  t1 <- now
  log $ "End " <> show (t1 - t0)
  log ""

natefaubion [11:10 PM]
You should use unsafeFreeze
freezeST will copy the object
which will just build it all over again

gregberns [11:10 PM]
I was wondering about that…

natefaubion [11:11 PM]
you should also use do syntax instead of *> if you are really serious about eking out performance

natefaubion [11:11 PM]
since in 0.12.1 it’s optimized to statements

gregberns [11:11 PM]
… what do I need to change to do that??

natefaubion [11:11 PM]
foreach

foreach arr \i -> do
  _ <- poke ...
  pure unit

magic-do does not optimize applicative operators
it probably should

gregberns [11:13 PM]
ah alright I’ll give that a try too

natefaubion [11:14 PM]
I would expect that to still be a good bit slower though, because it will still allocate thunks for the effects
Unfortunately there’s no EffectFn analog for ST

gregberns [11:16 PM]
I think that test took something like 7 seconds vs another similar example that was at 14 sec (w/o ST), so we did cut it in half. I’ll re-run the test with those two changes.

Unfortunately there’s no EffectFn analog for ST

yea that’s what we were wondering, if there could be a way to do all the mutations in an Effect.
Oh here was another question I had. If we do just use ST, is there any way to get away from using STObject/ get away from returning a generic Object and having the rest of the system work on that?

natefaubion [11:22 PM]
ST is just an FFI type, you can do anything in ST if you write FFI to do it

gregberns [11:22 PM]
ah, ok that makes sense

natefaubion [11:23 PM]
That HashMap implementation supposedly has an efficient implementation of fromArray
that uses a mutable API
I can’t speak for the library though
its backed by a lot of JS so hard to verify

gregberns [11:29 PM]
gottcha, alright I’ll take a look at that instead. I cant think of any reason to dislike HashMap at this point.
Thanks for all your help. We’ve been working the last 2 weeks in PS and its the deepest dive into a Haskell-like lang and I am loving it and am about willing to do anything to stick with it. Love the “it compiles, it works”… coming from JS and C#, my life is just better

natefaubion [11:31 PM]
I would encourage you to stick with it, but be aware that is a relatively naive language, runtime-wise (at least the primary JS backend)
It gives you a few tools, but it is not really optimized at all

gregberns [11:36 PM]
Were doing all we can to prove it out. The CIO is concerned with the performance of Node vs C#, and I think we can decently prove the Node/C# is generally neglagible. We just need to mitigate any serious concerns about PS’s performance. I dont think we need to be equal, but I think were 20 times slower now. We’re just trying to get that down as much as possible… and not use too many ugly tricks (ST).
(ugly tricks: hard for noobs to understand)

gregberns [12:12 AM]
@natefaubion the ST changes you suggested halved the time.
– 727.0 w/ unsafeFreeze and do
– 984.0 w/ freeze and do
– 1574.0 w/ freeze and NOT do
– 1298.0 w/ unsafeFreeze and NOT do
The HashMap impl is worse, but the implementation ‘looks’ very clean and readable
– 2936.0 w/ HashMap

gregberns [12:26 AM]
HashMap just 3x it with our process (1m entries)
Insertion with Map: 10138.0
Insertions with HashMap: 2857.0

GordonBGood · March 21, 2019, 6:02am

@gregberns, As you seem to be doing, one of the first things I do with a new language is test it for speed in different ways. The following benchmark that does a Sieve of Eratosthenes on about a CPU L1 cache sized buffer of 16 Kilobytes with “bit twiddling” bit packing is about as fast as it can be written in PureScript; it is about four times slower than using Fable (an F# to JavaScript transpiler just as PureScript does from it’s own language) for the following reasons:

Poor optimization of monadic code as in using function calls to implement “do” notation…
Output of ES3 code so that new function nesting needs to be used in order to obtain new “var” scopes (ES3 only has function scoping)…
Many extra function calls required to implement currying of the peek/poke functions…
Thus the inner loop is slow due to multiple function calls per loop, only some of which get optimized away by the JavaScript JIT compiler…

It is about four times slower than Fable, which doesn’t normally use monadic code to mutate the contents of arrays and optimizes the use of currying to reduce function calls as well as uses ES2015 with “let/const” variable definitions to avoid the extra levels of function calls for scoping in loops, but which is still a few tens of percent slower than pure JavaScript due to not entirely optimizing the output JavaScript code. As well, Fable can take full advantage of the newer JavaScript primitive integer arrays for further gains in speed using a tweaked algorithm due to supporting the “Byte” type where PureScript only supports the “Int” type in normal List/Arrays as per ES3. For these reasons, if you need the maximum in speed for games or high level math but don’t want to code in JavaScript, Fable is likely a better choice than PureScript.

Benchmark code:

module Main where

import Prelude
import Effect (Effect)
import Effect.Console (log)
import Effect.Now (now)
import Data.DateTime.Instant (unInstant)
import Data.Time.Duration (Milliseconds(..))

import Data.Int.Bits (shl, shr, (.&.), (.|.))
import Control.Monad (pure)
import Control.Monad.ST (ST, run)
import Partial.Unsafe (unsafePartial)
import Data.Array (unsafeIndex, replicate)
import Data.Array.ST (STArray, unsafeThaw, unsafeFreeze)
import Data.Array.ST.Partial
import Data.List.Lazy
import Data.Maybe

test :: Unit -> List Int
test unit =
  let
    cmpsts :: forall s. ST s (STArray s Int)
    cmpsts = unsafeThaw $ replicate 4096 0
    testr i = run ( do
      cs <- cmpsts
      v <- unsafePartial $ peek (i `shr` 5) cs 
      pure $ v .&. (1 `shl` (i .&. 31)) /= 0 )
    loop x =
      if x <= 0 then unit else
      let
        loopi i =
          if i > 254 then unit else
          if testr i then loopi (i + 1) else
          let
            p = i + i + 3
            s = (i `shl` 1) * (i + 3) + 3
            cull c =
              if c >= 131072 then unit else
              let
                w = c `shr` 5
                cullit = run ( do
                  cs <- cmpsts
                  v <- unsafePartial $ peek w cs 
                  _ <- unsafePartial $ poke w (v .|. (1 `shl` (c .&. 31))) cs
                  pure unit )
              in cull (c + p)
            docull = cull s
          in loopi (i + 1)
        doloop = loopi 0
      in loop (x - 1)
    doit = loop 1000
    composites = run ( do
      cs <- cmpsts
      unsafeFreeze cs )
    test i =
      let v = unsafePartial $ unsafeIndex composites (i `shr` 5)
      in v .&. (1 `shl` (i .&. 31)) == 0
  in 2 : mapMaybe (\i -> if test i then Just (i + i + 3) else Nothing) (0..131071)

main :: Effect Unit
main = do
  start <- now
  let answr = test unit
  stop <- now
  let
    (Milliseconds startn) = unInstant start
    (Milliseconds stopn) = unInstant stop
    elapsed = stopn - startn
  log $ "Found:  " <> (show $ length answr) <> " primes up to 242146."
  log $ "Took " <> show elapsed <> " milliseconds."

The above code outputs the correct number of primes over this range as 23000, the culling is repeated 1000 times so that the total of approximately (just less than) 200 million inner cull loops take a reasonable amount of time for timing, which is about 5.3 seconds on my Windows tablet (Intel i5-Z8350 @ 1.92 Gigahertz).

The above code is faster at array manipulations than even your new tuned version using the hashing implied by using String indices on an unspecified CPU due to the cost of the hashing, but at least it shows what the minimum loop time can be on this slowish CPU, as you likely are using a faster one than this - your fastest times for inserting a million elements are taking about a over a thousand nanoseconds per element insertion whereas this is taking about 26.5 nanoseconds per array modification or about 50 times faster showing that loop overhead isn’t likely a problem:

Due to the lack of “do” optimizations, one can’t write this code as elegantly as for Haskell as the “do” interferes with the ability to do Tail Call Optimizations (TCO’s) and we get stack overflows if the inner culling loop is inside the “do” (an intermediate function call prevents TCO), which also adds to the number of function calls per loop. PureScript isn’t very sophisticated at code optimizations…

As to even your fastest Object insertions, I notice that you are creating the String “key” values using “show” inside your timed code and have measured that about half of your fasted expended execution time is using in converting integers to generate these String “Key’s”. I don’t think that would be a normal use case?

My conclusion is that for this purpose of inserting String “keyed” values into an Object, by using the FFI interface as here it shouldn’t be significantly slower than by direct use of JavaScript or much slower tha any other language as most of the time expended is used doing hashing of the String “keys” in the JavaScript run time, which is necessary inn any case.

However, for tight loops as are typical for games or for high order math array manipulatioms, PureScript is currently considerably slower than JavaScript or Fable generated JavaScript due to the extra levels of function calls and, although generating “safe” code, may not be the best choice for high performance purposes.