1. W=constant initialization

output will be the same → gradient will be same

2. Random numbers(small, big)

W_small= 0.01 * np.random.rand(Din, Dout)
W_big= 1 * np.rando.rand(Din, Dout)

okay for small networks, but problems with deeper networks

W = np.random.randn(Din, Dout) / np.sqrt(Din)

Untitled