I will attempt to give some guidance on the first topic. Trying to keep a TCP connection persistent is “very hard”. So, edge devices do have to handle re-connecting. The connection may go away for many reasons (including “router is re-started”, but that’s just one). Upon re-connection, state needs to be re-synchronized.
For “publish / subscribe” topics, you can have history (or just a “last published value”). If your application is designed such that this last value is the current state, that can be sufficient. If the state is “too large” to publish each time, you can subscribe to state updates and immediately call some sort of “get_current_state” RPC.
See https://crossbar.io/docs/Event-History/ for more about event history. For just keeping the last even on a topic, see
get_retained in the publish/subscriber options and also this example: https://github.com/crossbario/autobahn-python/tree/master/examples/twisted/wamp/pubsub/retained
It is extremely unlikely that Crossbar will ever have “persistent TCP connections”, especially across re-starts of the router itself. This typically requires specialized hardware – and crossbar keeps a lot of state besides the TCP connections themselves.
To the other topic (single point of failure) currently a single realm is run on a single process. We are actively implementing “router to router links” that will allow a single logical realm to be served by multiple processes (on separate machines). Such failover will still rely on edge devices re-connecting (if the router process they’ve attached to goes down) – however, they will be able to do so immediately. There are other pieces to the “scaling” picture which are already implemented such as “proxy workers” which can offload most of the workload (TLS termination, authentication, etc) the router process often does. These can already run in multiple processes.
So, to summarize: your design does need to anticipate re-connections and crossbar + WAMP provide some tools (including event-history, retained events) to help with this. Scaling a single realm across multiple processes is coming soon, but will ultimately rely on re-connecting to divert traffic to other processes / machines. Additionally there are existing features to help scale across multiple cores.
Hope this helps!