Inspecting Generalization Of Reinforced Learners: The Halma Benchmark