Slow write speed when multiple clients access

ssc · November 22, 2022, 2:46pm

Hi,
I try to use HSDS as a data collector for multiple clients, i.e. multiple devices write data to HSDS.
Each client writes data every 1s to his own domain (e.g. /clients/client0.h5).
The clients use h5pyd to access the domains and write to the arrays.

First I hosted HSDS as a docker image on a Azure VM. The yml is configured to use one servicenode and 8 datanodes.
When one client starts writing data everything runs fast.
But as soon as new clients start accessing their domains and start writing, it slows down a lot.
With one client uploading it takes about 0.02s. 8 clients uploading variies between 5s to 8s for each.

Now I switched to Kubernetes. I followed the instructions to install HSDS on Azure Kubernetes and also added Azure AD and Front Door.
I scaled HSDS on Kubernetes to 30 pods. Everything works great for up to 5-8 clients(it varies for every test run I take).
But there are at least two that take 2s to write their data. You can also see the send/write time increasing when number of clients accessing rises.

Can somebody relate and give me tip? Where could this come from?
Am I missing something?

hyoklee · November 22, 2022, 3:30pm

Out of curiosity, why do you use HSDS instead of Azure Event Hubs?

jreadey · November 22, 2022, 5:22pm

Hey,
Interesting use case!

There could be several different reason for the slowdown… If you grep the logs, do you see any WARN or ERROR messages? When HSDS starts to throttle the load (by emitting 503 errors), you’ll see a WARN line in the log.

For Docker if you run “docker stats” do you see one or more of the HSDS containers with high CPU load? By default the runall.sh script creates just one SN container. Since all requests route through it, the SN container can get overwhelmed when there are too many clients. You can setup HSDS with multiple SN containers, but you’ll need a mechanism to distribute requests between the different ports (i.e. with 4 SN’s, you’d have ports 5001,5002,5003, and 5004 by default).

Kubernetes will load balance between the different HSDS pods, so I’d think that would work better with your scenario. Are you running the clients as Kubernetes pods or externally? If the latter, the Kubernetes ingress could be the bottleneck. Running clients as Kubernetes pods should work better as there is no ingress to contend with.

Lastly if you clients and/or storage are on a different machine, check the how much I/O you are seeing. Obviously once get close to the bandwidth limits of the network, that will be the limiting factor.

Could you post a code sample for your client? I can try to replicate your setup.

John

ssc · November 24, 2022, 7:29am

Thank you so much for the quick and detailed reply.

With the Docker approach I already thought that the one SN is probably the bottleneck for all the requests.
Since the load balancing is handled by Kubernetes I will focus on that.

In my testing, the clients are currently on different Azure VMs and continuously writing to four arrays (overall like 15000 values) every second.

I’ve been playing with the number of pods and node pool size in Kubernetes in the meantime. It seems like 30 pods is not enough. If I increase the pod count to 100 and distribute it across multiple machines in the node pool, I no longer have the slowdown. So I think I need to optimize my Kubernetes setup.

ssc · November 24, 2022, 1:13pm

I thought of HSDS because auf the data format and ease of use in python.
That’s why I wanted to try it out and see if it fits my application.
Thanks for the suggestion, I will also have a look at Azure Event Hubs.

jreadey · November 28, 2022, 12:00am

Your Kubernetes investigation is interesting – One HSDS pod can use at most 2 core’s worth of CPU, so if you are getting better performance with more VMs in your cluster (with lower utilization per VM) it seems like the bottle neck is something other than CPU or memory. Network bandwidth will scale with the number of VMs, so it would be worth while to look at the net I/O usage on the VMs and see if that is what is maxed out.

This is not strictly related to performance, but is your client extending the size of the array with each write? There’s a bit of a race condition with multiple writers. Consider this code:

num_rows = 15000
extent = dset.space[0]
dset.extend(extent + num_rows, axis=0). # increase the size of the first dimension
dset[extent:extent+num_rows] = more_data # write to the new rows

This works for with one writer, but if you have multiple writers, you’ll run into problems with the additional rows not actually getting added and/or data getting overwritten depending on the order of how requests get processed by HSDS. There’s a dset.append() operation in h5pyd that does an atomic extend and write, but it’s only supported for 1D datasets.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Slow write speed when multiple clients access