Using multiple cores for routing

Hello,

We only use Crossbar’s routing features. In our typical running system, we have roughly 10,000 long-lived WAMP connections to Crossbar, and some smaller number of short, one-off connections. Those connections each register as their own entity. After that, all of the traffic is routing messages amongst those registered connections.

We currently only take advantage of a single core per instance. I’d like to be able to utilize additional cores on our instances. What is the best way for us to utilize more cores in our use case (i.e. mostly routing messages and occasionally establishing lots of new connections after restarting another service)?

I’ve read about the Proxy Workers, but it sounds like those won’t be helpful in routing all of the messages. Is that correct?

I’ve also read about r-links, but it sounds like those are not production-ready yet. Is that correct?

Thank you,
Ryan

Hi Ryan,

I’d like to be able to utilize additional cores on our instances. What is the best way for us to utilize more cores in our use case (i.e. mostly routing messages and occasionally establishing lots of new connections after restarting another service)?

Proxy workers are able to off-load (scale to multiple cores / boxes) all of these portions of the workload of WAMP routing:

  • accepting and authenticating incoming WAMP connections from clients
  • supporting multiple WAMP transports, including WebSocket and RawSocket
  • supporting all WAMP serializers, including JSON, CBOR, MessagePack and Flatbuffers
  • terminating TLS and off-loading encryption
  • serving Web services, such as static files, redirects, HTML templates and so on

We’ve recently added some docs for these new features

https://crossbario.com/docs/crossbarfx/scaling/introduction.html#web-and-router-clusters

So eg should your clients connect over WAMP-WebSocket-JSON-TLS, you will be able to scale substantially already using proxy workers alone.

Here is an example where we measured a test setup for a customer:

This test ran load clients on one machine, where the clients connect over RawSocket-TCP (no TLS, CBOR serialization) and call a WAMP procedure with 256 random bytes as the (single positional) argument.

The backend procedure called is running on the testee machine, and simply returns the 256 random bytes provided as call argument.

  • CrossbarFX configuration: 8 router worker, 16 proxy worker (ratio proxy-to-router workers is 2:1)
  • 128 client connections were used (#router X #proxies => 8 X 16 = 128)

Test results:

  • more than 150,000 WAMP calls/sec (@ 256 bytes/call) are performed by the load clients
  • traffic runs over a real network (AWS internal) with almost 1Gb/s WAMP client (up+down) traffic
  • CrossbarFX consumes 12 CPU cores and 6GB RAM
  • the test was run constantly at full load for more than an hour with zero errors
  • memory consumption remained constant, the testee machine stable

Besides proxy workers, the other element of clustering are indeed “router-to-router” links (aka “rlinks”).

These are used to scale the actual core routing for a single realm beyond a single worker process. This is not yet officially announced, but we’ve recently made RPC fully work (in addition to PubSub) over rlinks -/

Both proxy workers and rlinks are part of Crossbar.io OSS and fully usable “as is”, that is with manually written node configuration files. Crossbar.io FX is adding a master node that can automatically orchestrate, manage and monitor setups with many nodes / workers.

Actually, we will now very soonish publicly announce this stuff and also publish the complete source code for Crossbar.io FX as well (on GitHub).

Cheers,
/Tobias

1 Like

Thanks, @oberstet! It sounds like Proxy workers make the most sense to try first. Just to confirm, even if the majority of our traffic through Crossbar is only routing messages for long-lived connections, you think Proxy Workers still should be able to handle a decent amount of the load?

For a two core machine, would you recommend one proxy worker and one router? Or two proxy workers in addition to the router?

Thanks,
Ryan

Just to confirm, even if the majority of our traffic through Crossbar is only routing messages for long-lived connections, you think Proxy Workers still should be able to handle a decent amount of the load?

yep, exactly! compared to the WAMP routing decision logic and message handling, other parts like WebSocket processing, JSON parsing and TLS encryption consume lots more CPU cycles

For a two core machine, would you recommend one proxy worker and one router? Or two proxy workers in addition to the router?

I would try 2 proxy + 1 router and 4 proxy + 1 router workers - it will depend on the specific mix of connections

Thanks, @oberstet! I’ll give it a shot and follow up with how it goes.

Testing proxy workers now, seems to work! Some thoughts:

  1. It’d be nice to specify a number of replicas for a proxy worker. I want my workers to have the exact same config, just scaled to N instances. Currently, it seems like I have to just copy+paste the same proxy config N times.

Example config:

{
  "type": "proxy",
  "replicas": 2
}
  1. I’m getting intermittent wamp.error.authentication_failed errors after switching to proxy workers. Not sure why, I didn’t change the authentication config.

Checking out rlinks now, hopefully it’s easy to setup, the docs are a bit technical! I’d appreciate some guide on how to setup rlinks on kubernetes.

Really looking forward to the announcement.

Thanks for testing feedback! We need to hash out the remaining issues before the “big release”. And that includes authentication/authorization (support all methods with proxy and router worker clusters).

Rgd the “replicas” idea: in fact, we have something even better that also works with multiple nodes.

This command

crossbarfx shell --realm default add webcluster-node cluster1 all \
    --config '{"parallel": 8}'

will create, start and keep running a total of “number of nodes X 8” proxy workers on cluster1. Eg with 4 nodes paired to your management realm, above command will fire up and manage 32 proxy worker processes.

https://crossbario.com/docs/crossbarfx/scaling/examples.html#parallel-web-clusters

Rgd simpler, more practical docs: absolutely. This is another area we’re working on (before big release).

What we already have is a super convenient/powerful Terraform module

You can use it to create a complete AWS based Crossbar.io cloud setup.

Rgd K8: what exactly would you like to see? Ideas/suggestions?

Personally, I can see these building blocks (when assuming K8):

A. For each K8 (application) pod, add a Crossbar.io node as a side-car container that only runs router workers that listen locally (within the pod) but maintain rlinks to all other side-car nodes (that is, the side-cars are fully meshed)
B. Have a K8 pod with Crossbar.io proxy workers that run public listening endpoints, and keep backend connections to the router workers
C. Have the master node separated in another K8 pod

Incoming client connections would need to be directed (at TCP/IP level) by K8 to the proxy workers in Pod B. The proxy worker will direct (at WAMP level) by proxy workers to router workers running as sidecars (A). The correct setup of proxy/router workers and routes/connections/rlinks is managed by the master node.

@oberstet, thanks again for the feedback in August. I played around with the proxy workers back then, but was seeing some higher CPU usage. I finally got around to putting together a more robust test, but I’m still seeing much higher CPU usage with the proxy workers than if I just have a single router. My test was set up/run as follows:

  • I had two Crossbar instances. Both instances were running crossbar 20.8.1 with wsaccel 0.6.2.
  • One Crossbar instance was running just a router, while the other was running a configuration with a router and two proxy workers.
  • Each crossbar instance ran on a relatively small instance in AWS, a t3a.small. I mostly wanted to compare the router CPU usage vs the proxy CPU usage. So, the overall size didn’t seem like a big issue for me.
  • After the Crossbar instances were set up, I distributed 500 connections roughly equally between the two for about 250 connections each.
  • Each connection was a long-lived WAMP connection.
  • Every connection forwarded a message to another WAMP connection on the same Crossbar instance approximately once per second. There was some randomness here to ensure that every connection did not send at the same time.

When I ran this for my existing router configuration, CPU usage was pretty constant around 15% to 20% of the two core machine. When I enabled the proxy workers on one instance, CPU usage would spike to 90% to 95% while connections were being established. Then, it would fall to about 30% while the clients were only sending messages. In both cases, the CPU usage was much higher than the 15% to 20%.

My configurations were as follows:

  • Router only:
{
    "version": 2,
    "controller": {},
    "workers": [
        {
            "type": "router",
            "realms": [
                {
                    "name": "realm1",
                    "roles": [
                        {
                            "name": "anonymous",
                            "permissions": [
                                {
                                    "uri": "",
                                    "match": "prefix",
                                    "allow": {
                                        "call": true,
                                        "register": true,
                                        "publish": true,
                                        "subscribe": true
                                    },
                                    "disclose": {
                                        "caller": false,
                                        "publisher": false
                                    },
                                    "cache": true
                                }
                            ]
                        }
                    ]
                }
            ],
            "transports": [
                {
                    "type": "websocket",
                    "endpoint": {
                        "type": "tcp",
                        "port": 8090
                    },
                    "options": {
                        "auto_ping_interval": 15000,
                        "auto_ping_timeout": 10000,
                        "auto_ping_size": 4
                    }
                }
            ]
        }
    ]
}
  • One router with two proxies:
{
    "version": 2,
    "controller": {},
    "workers": [
        {
            "type": "router",
            "realms": [
                {
                    "name": "realm1",
                    "roles": [
                        {
                            "name": "anonymous",
                            "permissions": [
                                {
                                    "uri": "",
                                    "match": "prefix",
                                    "allow": {
                                        "call": true,
                                        "register": true,
                                        "publish": true,
                                        "subscribe": true
                                    },
                                    "disclose": {
                                        "caller": false,
                                        "publisher": false
                                    },
                                    "cache": true
                                }
                            ]
                        }
                    ]
                }
            ],
            "transports": [
                {
                    "type": "rawsocket",
                    "endpoint": {
                        "type": "unix",
                        "path": "router.sock"
                    },
                    "serializers": ["cbor"],
                    "auth": {
                        "anonymous-proxy": {
                            "type": "static"
                        }
                    }
                }
            ]
        },
        {
            "type": "proxy",
            "options": {
            },
            "connections": {
                "conn1": {
                    "transport": {
                        "type": "rawsocket",
                        "endpoint": {
                            "type": "unix",
                            "path": "router.sock"
                        },
                        "url": "ws://localhost",
                        "serializer": "cbor"
                    }
                }
            },
            "routes": {
                "realm1": {
                    "anonymous": "conn1"
                }
            },
            "transports": [
                {
                    "type": "websocket",
                    "endpoint": {
                        "type": "tcp",
                        "port": 8090,
                        "shared": true,
                        "backlog": 1024
                    },
                    "options": {
                        "auto_ping_interval": 15000,
                        "auto_ping_timeout": 10000,
                        "auto_ping_size": 4
                    }
                }
            ]
        },
        {
            "type": "proxy",
            "options": {
            },
            "connections": {
                "conn1": {
                    "transport": {
                        "type": "rawsocket",
                        "endpoint": {
                            "type": "unix",
                            "path": "router.sock"
                        },
                        "url": "ws://localhost",
                        "serializer": "cbor"
                    }
                }
            },
            "routes": {
                "realm1": {
                    "anonymous": "conn1"
                }
            },
            "transports": [
                {
                    "type": "websocket",
                    "endpoint": {
                        "type": "tcp",
                        "port": 8090,
                        "shared": true,
                        "backlog": 1024
                    },
                    "options": {
                        "auto_ping_interval": 15000,
                        "auto_ping_timeout": 10000,
                        "auto_ping_size": 4
                    }
                }
            ]
        }
    ]
}

I was hoping that this would be a great way to scale Crossbar up and take advantage of more cores on a single instance, but the CPU performance seems to be much worse than just running the single Router by itself.

Have I configured something incorrectly for the Proxy workers? Do you have any suggestions for modifying the configuration to be more performant/a better match for the existing single router configuration?

Thank you in advance!

I was hoping that this would be a great way to scale Crossbar up and take advantage of more cores on a single instance

yes, that is exactly what proxy workers and rlinks provide. pls see my post from august with benchmark numbers. unfortunately, I don’t have time to debug your setup. if you are really interested, we could certainly create a scaling PoC, on your or public cloud infrastructure and proof scale-up using your requirements / services.

cheers,
/Tobias

Thank you, @oberstet. I understand that you’re busy and may not have time to dig through the above configs. I appreciate all the help/feedback you’ve given so far. I hope you have a happy, relaxing holiday season!

If you, or anyone else on the forums does find time, my main goal was to create a minimum configuration that would support two proxy workers and could handle long-lived Websocket connections that utilized pub/sub and broadcast messages. I thought that my config was pretty bare other than for that support. So, I’m confused as to why I’m seeing much higher CPU usage than the tests posted earlier in this thread. Maybe that earlier tests did not have as large of a focus on Websockets?

I’ll continue to play around with the configuration, but if anyone has any suggestions or tweaks to the configuration that would make it more performant, please let me know.

Thank you,
Ryan

I tried to dig into this more today. I ran Austin on the Crossbar instance that was running the Proxy workers while I was establishing about 50 Websocket connections (in batches of 10 connections). Then, I created a flame graph of the sample. It looks like the vast majority of the time is being spent in _read_node_key, specifically when creating the pyqrcode instance: https://github.com/crossbario/crossbar/blob/5aa0682c6ef1edee8ede920873e16571bf4e68a6/crossbar/common/key.py#L123

I tried to read through all of the code leading up to this point, and I cannot identify anything that I could change in my configuration that would cause the code to avoid going down this route. This makes me think that for any websocket configuration using proxy workers, the Proxy worker will consume a similar amount of CPU per core (~90% in my case).

@oberstet, in the test example that you posted, I noticed you called out the number of WAMP calls/sec (150,000). Do you recall how many independent WAMP connections were established? Do you recall if your profiling during the connection establishment phase show a similar CPU usage when the clients were connecting?