C (fraction of clients involved in each round): Because in Federated Learning we typically deal with a large number of mobile devices, involving all clients in each iteration will be prohibitively expensive. Besides, the authors said they see diminishing return when increasing C.
B (local mini batch size) and E (local epochs): These two knobs together controls: (1) how much computation a client performs in one round; (2) how many times the local model gets updated. Remember communication cost dominates in the federated learning settings. Computation on mobile devices is relatively “cheap”. Using extensive experiments, the authors suggest it is usually beneficial to increase the work on clients each round as long as it is not overdone. The authors even suggest to decrease B as long as the hardware’s parallelism is fully exploited, so that the model gets updated more frequently.
The most suspicious part is that, for general non-convex loss functions, simply averaging two models offers no guarantee the averaged model will be any better. In fact, it can be worse than either input models. This problem can even be exacerbated by non-IID data used to train the different input models.
Figure Left below shows an example, where mixing two models with 0.5/0.5 yields higher loss. However, this happens when the two models are initialized independently when they are trained. In Figure Right, the authors empirically showed that, if the two models are initialized the same, even if they are then trained on disjoint data, the averaged model can do better than both.
Going back to the FedAvg algorithm, each round can be seen as a mini instance of Figure Right below, because the server re-initializes all participating clients with the same model. Although, after performing E local epochs, the local models on different clients are likely to diverge, the empirical evidence gives us some relief that averaging them can be beneficial.
Most of statements made above are drawn from experimental results. The authors did experiments to study the effect of C, E and B on image (MNIST) and NLP data.