Ivan Sysoev -> What would you like to spell

What would you like to spell?

MIT Media Lab, 2018-2019
Roles: Project lead; Concept; Design; Development; Research
Collaborators: James Gray, Sneha Makini, Susan Fine. Advisor: Deb Roy

Children acquire the basics of literacy at the age when playful learning is particularly important. We created an early literacy app (shown on the picture below) to build upon the appeal of open-ended, expressive play. In this app, children can make whatever words they like, then compose scenes out of these words and images associated with them. To produce optimal learning while remaining scalable, we incorporated automatic scaffolding into the system. The scaffolding system guides children through building of the words that they choose to make. But this requires the system to "know" what are these words.

In this project, we investigated five mechanisms for doing that: word bank, speech recognition, text recognition, network of semantic associations, and invented spelling interpretation.

Design

Word bank (WB)

Word bank is a simple collection of words arranged under categories. We selected words to align with children's widesperead interests frequently observed in previous studies, such as their names and cartoon characters. We also considered which objects can function well together: e.g. items from categories "family" and "things at home" can be used to create home scenes.

Speech recognition (ASR)

Speech recognition allows the child to directly tell the system what s/he would like to make. But since the ASR on children's voices is not very robust, we complemented it with a UI feature: the system shows a list of candidate words, and the child can select one by tapping on it. The long-eared fox on the recognition screen helps children to see the status of the system (e.g. processing, success, failure), and provides a backstory helping to explain recognition errors (the fox is old - it has a beard - and thus it cannot hear very well). Previous research showed that such backstories help children to be more patient about the imperfections of learning systems.

Text recognition (OCR)

Text recognition allows children to "grab" interesting words from their environments and books.

Semantic association network (SAN)

Semantic association network allows the child to see words related to other words in the scenes they built. For instance, the picture below shows the associations to the word "dagger" that popped up after tapping on this item on the scene depicting two ninjas. The child can explore the network of associations in-depth by tapping on these icons - the associations for that word will then be shown. Tapping on an association will also reveal a button for spelling it.

Invented spelling interpretation (ISI)

Invented spelling is a fascinating phenomenon when children come up with their own ways to spell words based on their developing understanding of their phonetic structure and the relationships between sounds and letters. For instance, "fish" could be spelled as FES (where the name of the letter E stands for its sound), and "cat" could be spelled as KT (omitting the medial vowel and representing the sound [k] as K). Our system interprets such spellings and suggests candidate words that they might represent.

Findings

Evaluation of the input mechanisms was part of a study examining child-driven, machine-guided approach to early literacy learning. It occurred at a public charter school in Boston area. 25 children used SpeechBlocks in their class throughout the semester. The reader can find more details about this study and its findings in our Computers & Education paper.

Usage patterns

Though children were able to use both scaffolded and non-scaffolded modes, about 85% of words were constructed with the help of scaffolding. Children had difficulties using invented spelling interpreter, which as a result contributed to only ~6% of words. The fractions of words originated from semantic associations (~21.5%), speech recognition (~20%), word bank (~18%) and text recognition (~16.5%) were approximately equal. Different children used these mechanisms in very different proportions, but the approximately equal overall usage of the four suggests that each of them was valuable in its own way.

Inputs' roles

Here is an example of scene construction using different input mechanisms. At first, the child built two ninjas and said, “They are father and son. They are practicing”. She then expressed a desire to give them weapons and used invented spelling recognition to create SOD (sword). Then she resorted to speech recognition to build SHIELD. Afterwards, she tapped on the sword to see the related words, picked DAGGER and gave it to the small ninja. This was followed by a long exploration of the semantic association network, until she stumbled upon the word PRISONER. This discovery prompted her to exclaim, “I’m going to make a villain to fight them!”, which led to the complete scene.

This sequence is illustrative of the roles that were generally served by different input mechanisms. We saw three main roles:

Deliver words that player had in mind. It was fulfilled particularly well by speech recognition. In our example, the child wanted to build SHIELD, and was able to accomplish it by using ASR. Children also resorted to speech recognition when they saw something interesting built by their peers, and wanted to make something similar. For example, one child built a scene showing panthers encroaching on some tigers, and showed it to a friend. The friend liked the idea and used ASR to build his own panther attacking a giraffe. The first child then “one-upped” her friend by using ASR to build a crocodile that, according to her, devoured all the other animals.

Help the player come up with ideas. It was fulfilled particularly well by the association network. In the ninja example, the child appeared to ran out of ideas and resorted to the associations. She went through a long sequence starting with SWORD: SWORD -> WARRIOR -> HERO -> BATMAN -> DRAGON -> UNICORN -> CENTAUR -> GOBLIN -> WARRIOR -> HERO -> SOLDIER -> PRISONER - before she stumbled upon PRISONER. That gave her a new idea on what to build, one she was very excited about.

Be a fallback option. The "high-tech" modes of input, such as speech recognition or text recognition, were far from being perfectly reliable. On occasions when they childen could resort to using word bank, which was simple and robust, while still having plenty of interesting words to build.

Speech recognition was valuable

We were originally concerned that due to low quality of existing ASR systems on children's voices, the speech recognition input would be unusable. But children ended up using it successfully - partially because of the decision to show multiple recognition candidates, partially because of their own persistence. When the ASR failed to understand them, they patiently repeated their request for up to 6 times. This persistence also indirectly indicates the value of the system to them.

Two interesting things were observed with respect to ASR usage. First, children tended to speak to the system in sentences, e.g. "Mr. Fox, please give me a GORILLA". Second, they treated Mr. Fox (the avatar of the system) as a living character, talking to him politely, asking him to do things (“Mr. Fox, what time it is?”, “Mr. Fox, finish off (turn off) the tablet!”), asking us about his abilities ("Can he fly?"), encouraging him when ASR failed to deliver correct result (“Oh no! Mr. Fox, we need you! Mr. Fox, spell FIRE!”) and appreciating his work when the result was correct ("Mr. Fox, you busted it out!" (apparently in the sense "produced"), "Fox, I love you"). These observations may be useful in designing interfaces for open-ended speech interaction for children.

Invented spelling interpretation was hard to use

We only saw one child who consistently and purposefully built words using this method. For example, she used SR to make STAR, HR - to make CHAIR (note how she used the name of the letter H - [eɪtʃ] – to represent the sound [tʃ]). Other children typically just randomly arranged blocks in the input box until by chance they stumbled upon an output that they liked.

Here is some speculation as to why it might have been the case. First, it appears that many children didn't have sufficient phonological skills to identify the needed sounds. For example, during the demo period, a child couldn't identify the initial sound in BATMAN, saying "BATMAN starts with BATMAN". When asked to identify the last sound in a BOAT, he answered with the first sound ([b]), even when asked to try again. Second, children rarely removed blocks from the word box, even when they tried out multiple blocks in search for the correct sound. For instance, while trying to spell BOMB, a child first represented the sound [ɑ] as A, then O, and ended with BAO. Such scenarios led the interpreter astray. Finally, children occasionally put suitable blocks, but in the wrong order. The knowledge of spelling direction is not always firmly established at this age.

Given the proliferation of invented spellings in our studies with older children, it is possible that this input method would work much better for older ages.

Text recognition was problematic

Though children were excited about using text recognition, it was often associated with frustration, confusion and distraction. First, there were various issues with picking up words: (1) children holding the camera sideways relative to the text, or holding it too close to focus, or putting the tablet on top of the text; (2) players having a poor grip and shaking the tablet, which disrupted the camera's focus; (3) reflective surfaces producing glare; (4) child-oriented materials often containing highly stylized and decorative fonts; (5) words written by teachers on the board being underlined or smudged.

Second, children eagerly spelled the results of OCR errors, such as: CALE (for CUTE), DRAGO (for DRAGON), RZONT, FADER, LOORM, HEELS (for WHEELS), SOO, ODA. Since pre-literate children don't know what words actually "say", that might lead to confusion.

Finally, OCR was quite distracting for some. The most notable issue was "picture-taking". The user interface of OCR had a freeze button, intended to allow children to freeze the picture, put down the bulky tablet and pick the words from the snapshot. However, some children used this button to take portraits of their friends. Another unintended activity was to use the interface like camera-obscura to look around through it. Some children were also interested in just pointing at words around the classroom, waiting until they turn green and moving on. Still another activity was to explore the texts via hearing the words, but never attempting to spell them. While the last activity might have some literacy value, it is disconnected from the primary purpose of the system. These distractions were likely the reason why the week OCR was introduced, the number of spelled words per child fell nearly 20% compared to the previous week, and never returned to the level of that week.

Publications

Sysoev, I. (2020). Digital Expressive Media for Supporting Early Literacy through Child-Driven, Scaffolded Play. (Chapters 3 and 6.5) Doctoral dissertation, MIT Media Lab. pdf

To main