microservices ... need architecture help, detach call/background, work queue

#1

I want to run tasks in the background using a worker pool. See a previous response I made to another poster about this:

https://groups.google.com/d/msg/crossbario/eBjFez16SXA/orcfY8rWBwAJ

To do this today, I’m using beanstalkd as a task queue and I inject jobs into this queue in various ways but most often through crontab. The worker processes are then managed by supervisor in a pool of about 10 workers and these workers each wait for jobs on the task queue, pop one off, processes it then exit. Next, a new/fresh worker gets spawned by supervisor and that new worker waits for jobs. I’ve been using the concept of having workers die and respawn in order to ensure that memory is cleared and possible new code is reread from the filesystem. This may no longer be necessary as it appears these workers are quite stable these days.

I’m torn between wanting one powerful app vs several smaller apps that work together. Although the combo of cron + crossbar + supervisord + beanstalkd are all “working” right now, I’d like a cleaner and simpler design if possible. As my architecture and APIs are growing, I want fewer moving parts and less complexity.

So, I came across this documentation on scaling micro services with crossbar:

https://github.com/crossbario/crossbar-examples/tree/master/scaling-microservices

It looks like crossbar will allow me to register multiple clients for the same RPC calls and I can configure:

  • shared registrations,

  • concurrency, and

  • call queuing
    These features sound like possible replacements for beanstalkd?! So, here are some of my thoughts:

  • if I write a client in PHP using thruway, I’m not sure if it will allow concurrent RPC calls to the same client … meaning I don’t know if I can run calls in parallel or if PHP being single-threaded will end up serializing all the calls internally. I think because they are using reactphp it SHOULD allow calls in parallel to the same client … but I have to test and experiment with this … if any of you have experience, you might save me a lot of time and trouble. For instance, if I have an RPC call named “generate_complex_report” and this function takes 100 seconds to run (a lot of I/O wait) … if I call this function from crossbar 3 times rapidly, will it take 300 seconds to complete all 3 calls or will they run in parallel and only take 100 seconds?

  • Let’s say I’m wanting to run up to 10 workers in parallel … should I be connecting 10 separate clients with shared registrations and concurrency=1 or 1 client without shared registrations and concurrency=10?

  • when jobs are running from my generic workers through supervisord, in order to call specific functions, I have to pass the function names and parameters into the worker using the beanstalkd queue. This isn’t very difficult since I just populate a JSON struct and insert into the queue and pop off the queue and fetch out the same JSON struct. But as I do this work, I feel very silly because the ‘function name’ + args feels very much like WAMP protocol and … it feels like reinventing something that already exists inside crossbar! Additionally, I end up with a split architecture where I have my “worker API” separated from my “crossbar api” … and I’d like to marry these 2 back under the same single umbrella.

  • If I just want a worker to process something in the background and I don’t need to wait for the results, with crossbar, how can I detach from a process? In my crontab scenario, I have some CLI tools that ‘inject’ jobs into the queue at scheduled intervals. These injections could be replaced by WAMP rpc calls that get detached and are allowed to be call queued.
    So, before I run off and try to build all of this under crossbar using the micro services feature as described in the link above, can some of you comment on my thoughts above and tell me if I have the right idea or if this isn’t really what’s intended for this feature?

Some background jobs I’d be running are something like …

  • import data from 500+ remote locations every 5 minutes
  • run data aggregation for reporting
  • export data and send to 3rd party at given intervals
  • test remote location connectivity
    For importing data, let’s imagine we have a crossbar RPC call like:

import_data(location_id)

and a wrapper rpc call that might do all the locations at once:

import_all_locations()

Now, in cron, we connect to crossbar and call ‘import_all_locations’ and that function fetches a list of the locations from database and for each location calls import_data(location_id) with the given location_id. Suddenly, I’ve just injected 500+ API calls into the crossbar system. I don’t really need to wait for the output of these functions because I can just have each one publish a message to a given topic once they complete or fail. Is this “detaching” from a background process possible in crossbar? Can I use it this way?

– Dante

0 Likes

#2

Hi Dante,

    - if I write a client in PHP using thruway, I'm not sure if it will
    allow concurrent RPC calls to the same client ... meaning I don't know if I
    can run calls in parallel or if PHP being single-threaded will end up
    serializing all the calls internally. I think because they are using
    reactphp it SHOULD allow calls in parallel to the same client ... but I

There is a difference between running things concurrent and running parallel.

A non-blocking run-time can work can run many concurrent tasks even using 1 thread/process. However, such a thing cannot really run things in _parallel_, because that requires multiple threads.

Eg to make use of threads in a AutobahnPython/Twisted based client, you would use deferToThread() to push the load to background threads.

    have to test and experiment with this ... if any of you have experience,
    you might save me a lot of time and trouble. For instance, if I have an
    RPC call named "generate_complex_report" and this function takes 100
    seconds to run (a lot of I/O wait) ... if I call this function from
    crossbar 3 times rapidly, will it take 300 seconds to complete all 3 calls
    or will they run in parallel and only take 100 seconds?

    - Let's say I'm wanting to run up to 10 workers in parallel ... should I
    be connecting 10 separate clients with shared registrations and
    concurrency=1 or 1 client without shared registrations and concurrency=10?

If the workload is CPU bound, and the client run-time doesn't use threads, then use separate clients and concurrency 1.

If the workload is CPU bound, and the client run-time does use threads, then use 1 client per host machine with concurrency = number of threads you can make use of (max #CPU cores).

If the workload is IO bound (hence, the client will spend most of the time waiting for IO), then use 1 client and no max. concurrency (or set the max concurrency to the one your IO subsystem can handle - max concurrent IOs in flight).

In essence, it really depends on the kind of workload: CPU or IO bound.

    - If I just want a worker to process something in the background and I
    don't need to wait for the results, with crossbar, how can I detach from a
    process? In my crontab scenario, I have some CLI tools that 'inject' jobs

You can't currently. If the caller detaches, all outstanding calls will (well, should) be canceled, because there is no one to receive the results anymore.

    into the queue at scheduled intervals. These injections could be replaced
    by WAMP rpc calls that get detached and are allowed to be call queued.

So, before I run off and try to build all of this under crossbar using the
micro services feature as described in the link above, can some of you
comment on my thoughts above and tell me if I have the right idea or if
this isn't really what's intended for this feature?

Some background jobs I'd be running are something like ...

    - import data from 500+ remote locations every 5 minutes
    - run data aggregation for reporting
    - export data and send to 3rd party at given intervals
    - test remote location connectivity

For importing data, let's imagine we have a crossbar RPC call like:

    import_data(location_id)

and a wrapper rpc call that might do all the locations at once:

   import_all_locations()

Now, in cron, we connect to crossbar and call 'import_all_locations' and
that function fetches a list of the locations from database and for each
location calls import_data(location_id) with the given location_id.
  Suddenly, I've just injected 500+ API calls into the crossbar system. I
don't really need to wait for the output of these functions because I can
just have each one publish a message to a given topic once they complete or
fail. Is this "detaching" from a background process possible in crossbar?

So you do want to collect the results in the end?

Then why don't you want the caller to receive it?

Publishing to a topic, just to collect the results from there again seems more complicated ..

Not sure I get the whole picture of what you are after.

However, Crossbar.io is designed primarily as a message router, not as a job queuing system, not as a job scheduling system. You could implement such things on top of Crossbar.io, and we might even add the one or other feature to CB, but ..

Maybe this is of interest to you - it compares Celery and Crossbar.io:

https://www.youtube.com/watch?v=WijEe0Vkj3Y

Cheers,
/Tobias

0 Likes