Abstract. In order to determine which statistical tests can validly be applied to data that describe a temporal relationship between two or more repetitive movements by an animal, we evaluated empirically seven two-sample that seemed potentially useful: Student's t test, the Watson Williams test for means, the variance ratio F test, the Watson Williams test for the concentration parameter k, the Wallraff test, the Mann Whitney test and the Watson U2 test. Evaluations were carried out on the timing (phases) of bursts of muscular activity in one leg relative to those in another during free walking in cockroaches. Each statistical test was evaluated by dividing randomly a single parent set of data into two subsets, each subset containing about half the original data set. This division was repeated 400 times, thus generating 400 pairs of subsets. Each statistical test was used separately on the pairs of subsets to test the null hypothesis that the two samples of each pair came from the same population; this procedure generated 400 statistics for each test, one for each pair of subsets. An estimate of the reliability of each statistical test was obtained by comparing the number of times the test actually indicated a significant difference between subsets to the number of times it might be expected to do so (20 out of 400 when tested at the 5% level of significance). This procedure was repeated on ten different sets of data. The outcome of the evaluation suggested that, from an empirical point of view, Student's t, the Mann Whitney, the Wallraff and the Watson U2 tests may be useful in assessing differences among the data we analyzed. The variance ratio F test and the Watson Williams test for the concentration parameter k were clearly not usable. The Watson Williams test for means might be useful in some circumstances. Performing an arcsine transformation of the data did not significantly alter these results. Possible causes of the inapplicability of some of these tests to phase data are discussed.