4. Components of the Corpus

The corpus consists of teaching materials, classroom interactions, and students artifacts. The current release data are mainly transcripts of classroom audio/video recordings taped in more than 350 primary and secondary schools in Singapore. Here is a breakdown of the lessons coded and recorded to date:

Table 1. Statistics of Lessons Coded and Recorded
Subject Level Stream Units Lessons Duration(Hr)

English

P5

EM1

9

37

34.3

 

 

EM1/EM2

0

0

0.0

 

 

EM2

13

66

58.6

 

 

EM2/EM3

0

0

0.0

 

 

EM3

8

46

43.0

 

S3

EXP

17

65

62.1

 

 

EXP/SPE

0

0

0.0

 

 

NA

6

21

18.7

 

 

NT

8

30

25.0

 

 

SPE

3

10

8.7

Mathematics

P5

EM1

8

23

20.1

 

 

EM1/EM2

1

4

3.6

 

 

EM2

14

46

44.1

 

 

EM2/EM3

0

0

0.0

 

 

EM3

3

20

16.9

 

S3

EXP

17

58

54.9

 

 

EXP/SPE

1

3

2.8

 

 

NA

5

20

17.5

 

 

NT

11

42

31.0

 

 

SPE

2

6

5.2

Science

P5

EM1

6

26

18.9

 

 

EM1/EM2

0

0

0.0

 

 

EM2

16

46

31.2

 

 

EM2/EM3

1

2

1.8

 

 

EM3

2

6

3.8

 

S3

EXP

15

49

38.9

 

 

EXP/SPE

0

0

0.0

 

 

NA

6

17

14.2

 

 

NT

5

13

12.5

 

 

SPE

3

16

12.9

Social Studies

P5

EM1

5

11

6.3

 

 

EM1/EM2

1

4

3.6

 

 

EM2

16

37

21.8

 

 

EM2/EM3

2

5

3.1

 

 

EM3

4

10

5.8

 

S3

EXP

18

62

52.4

 

 

EXP/SPE

0

0

0.0

 

 

NA

9

30

24.5

 

 

NT

0

0

0.0

 

 

SPE

2

7

8.1

Tamil

P5

EM1

2

4

4.1

 

 

EM1/EM2

11

50

36.7

 

 

EM2

3

12

11.6

 

 

EM2/EM3

0

0

0.0

 

 

EM3

3

15

7.7

 

S3

EXP

1

24

23.5

 

 

EXP/SPE

7

2

2.5

 

 

NA

2

5

5.5

 

 

NT

2

6

3.6

 

 

SPE

2

5

5.8

Malay

P5

EM1

2

12

10.4

 

 

EM1/EM2

3

13

11.8

 

 

EM2

8

30

24.6

 

 

EM2/EM3

0

0

0.0

 

 

EM3

3

11

7.7

 

S3

EXP

8

26

24.1

 

 

EXP/SPE

0

0

0.0

 

 

NA

4

12

10.1

 

 

NT

3

8

5.8

 

 

SPE

2

6

5.7

Chinese

P5

EM1

7

23

25.5

 

 

EM1/EM2

0

0

0.0

 

 

EM2

10

36

30.6

 

 

EM2/EM3

0

0

0.0

 

 

EM3

14

10

7.1

 

S3

EXP

9

29

22.6

 

 

EXP/SPE

0

0

0.0

 

 

NA

3

8

6.8

 

 

NT

3

8

5.5

 

 

SPE

2

2

2.9

TOTAL  2 Levels  10 Types 351 1195 1008.4

Presently, there are 277 transcripts processed and annotated in the SCoRE corpus; another 248 recordings of lessons have been transcribed and now are under cleaning and annotation. Based on the statistics of the 277 lessons, on average there are 5000 words in one hour recording, thus the size of the classroom component of the SCoRE corpus will be about 5 million words.

In addition to the classroom discourse, SCoRE project is also collecting the relevant teaching materials (e.g. text books, handouts, etc.) and students artifacts in the corresponding classes. In this way, SCoRE corpus will eventually have approximately 50 million words in total.

Read more about the compilation of the corpus ...