output will be the same → gradient will be same
W_small= 0.01 * np.random.rand(Din, Dout) W_big= 1 * np.rando.rand(Din, Dout)
okay for small networks, but problems with deeper networks
W = np.random.randn(Din, Dout) / np.sqrt(Din)