Skip to the main content

Luncheon Series: Deb Roy on "The Human Speechome Project"

This afternoon, the Berkman Center Luncheon Series kicked off the new year with Deb Roy, who directs MIT's Media Lab’s Cognitive Machines group, and is Chair of the Academic Program in Media Arts and Sciences.

Deb discussed "The Human Speechome Project", which is an effort to observe and computationally model the longitudinal course of language development of one child at an unprecedented scale. For more information on his talk, click here.

If you missed the chat, catch the podcast audio & video at MediaBerkman later this week. Also, a live transcript follows after the jump.

Disclaimer: Please note that this transcript is almost certainly incomplete and contains grammatical errors.

Goal of the project: Advance our understanding how how children acquire language. Modes: Longitudinal, ultra dense sampling rate, envivo.

In this field, what's typical is that you bring a kid and mom into the lab, and get a couple of hours of audio, one or two hours once or twice a month. Which leads to a weak foundation for any type oftiering . Sparse, incomplete data. If you observe children grow, things happen in the course of days, not just a couple of times a month.

To minimize observer effects (people w/ tape recorder, etc in a room). The initial thrust remains to understand child development, one of the things we're now looking at is applying the tech to treat certain developmental disorders. There are various things like videoscrapbooking , parental aids, retail behavior. Interesting applications in those directions. If you own a retail store front, and have cameras, you can do what you want with that data. So it's a lot of possible directions of impact of the core technology beyond the application.

Learning words from sights and sounds. If you build a machine in some limited capacities of what the infant or child sees - essentially, the learning machine acts as a lens into the learning environment of the child and test hypotheses. To test semantic speech. So this system did in fact learn from child data. That got us thinking, if you bring a child/mom into a lab, they don't act naturally. What % of time does a child spend playing w/ toys. That gave rise with us looking to a new data set.

This is a picture of my house in Arlington. If you look at the ceiling, a yellow camera poking out. HiFi, high resolution camera, with a privacy shutter. If you were to look through the camera, going back about two years, above every light switch, there's a device with an interface to the house. Mic, camera, diary notes. The oops button - an antitivo button. A dialog manager that lets you erase back in time. Originally there were only two buttons. The fourth button, diary note: if you press that, in abackend database there's a flag that says something interesting happened, and someone in the house wanted to annotate.

I'll play you now a little example of actual video, so you can get a sense of the quality of one channel. What we have done over the past 30 months is capture 80k hours of video, 120k hrs audio, on a 200kgb disk array. In some sense you are looking at the world's first speechome. Roughly, 70-80% of my son's waking hours at home, from birth to age of 2.5. This is raw, un-analyzed set of data. Raises questions of what we can mine from it.

The first question, is how do you as a human analyst, even get a sense of what's in the data. I worked on a project a few years ago to put the house and senate online. It just reminds me if you think about the total amount of video and audio - this might be comparable. Issues of how you mine that and fine the bit you want when you want it - a person's ID/development - there is quite some connection.

For those not familiar, this is a spectogram. Digitalizing the content of audio. See the lines - where is there speech or no speech? Someone singing, talking? Is there water running? Widely used by people who do sound analysis. How about the equivalent of video?One of a handful of techniques - we're analyzing where there's movement, and leaving a trace, and as time scrolls by, the two of us are in a sort of dance, interacting, as we go through this. Each of these space time worms are a capture of one person's movement. This is over a minute of video, and you can read off certain things: there are two people, they have some coordinated activity. If you put those two things together, there's some audio, person in a third room, a person moved into the living room, there's speech, some water running.

Here is now a day: 24 channels. A period where all the sensors are off. The software is called "total recall". Space time worms, you can see where people are, when they are talking.

Three months: you can only see when data was captured.

Data Analysis

Given the tools for visualizing, we're developing tools to analyze fine grain details of the data. The interaction b/w my son and I, imagine wherei'm looking as a spotlight coming out of my head. As the spotlights shift, in tight synchrony, there's speech. A moment of join attention, where he follow my gaze, and he says "green ball" - joint attention, linking up the symbolic bits of speech to what we're both doing. He closes the loop and I give some reinforcement. Which,btw , if a child has autism, they may not exhibit at all. If you can characterize how they are shifting over time, that's of great clinical interest.

Developed BlitzScribing. We can take an hour of house time (all data) and transcribe in 1 hr 40 minutes. We're transcribing everything my son said - estimate - 16 million words of transcribed words / data. If you want to look at fine grain development of speech, you need that data and to see what's happening moment by moment. What we've developed is a method using computer vision. Takes a computer generated head and locks onto video head and re-estimates position, so the outcome is a relatively precise estimate of where the person is looking. Fast forward: imagine if you own sears, and you have a security camera, and you're curious if your customers are looking at certain things, you can imagine why this would be of interest.

Cross situational Experience

Since we're now transcribing, we can do things like pull up every time my son said the word "ball" to get a sense of what he thinks the word means. Is he using it in a general or narrow way, or is he using it correctly. Here's a walk through of 9 months of him saying ball in context. If you are a speech language pathologist, and interested in the development of speech, one time lapse - interesting - would be to hear the speech pattern of the development of speech.

What you will hear is when you hear my son saying water, his approximation was "gaga" and over the course of nearly a year he develops the proper pronunciation. It's not a linear evolution.

So, to summarize, what we're doing now is transcribing all speech heard and produced by him., What we want to do is trace the specific birth of words and phrases, in various ways starting with joint attention, looking at objects in the environment, and analyze the role of social an physical factors.

Scaling the Speechome (N=1)

All of this from a scientific perception is limited, b/c we have 1 subject. How do you do this for larger numbers of people, and why would you want to do this? Over the last 6 months, we've been working with the Director of Research at theGroden Center for Autism. Because of the specific kind of cues for understanding early language acquisition, also the same for autism. The question is can we detect earlier, and can we characterize the developmental trajectories that people are on. All of these are open questions. The problem is: there's about 3000 ft of concealed wiring in my house. Everything is embedded - very expensive, difficult. We've started thinking about a portable device. Mic, camera, w/ a base. In some sense, scaling back, not going for nearly 24/7 recording, but looking at key areas where a lot of interaction happens, and streaming out of the home into an analysis lab. There is a prototype we have at the media lab, and planning to deploy a pilot batch of these.

All sorts of questions get raised not just on my own home data, but this turns up the sensitivities, we have other people's data coming into our lab. Families where parents are headed to divorce (if a child w/Austism , higher likely hood of divorce), other kinds of data we're starting to capture. And also, there's huge motivation to get the data, so we can get a fix of what's going on.

In the retail space, one possibility: the whole premise of this data is cross - modal analysis. You have the context and the linguistic signal - speech - and we want to build machines that understand the difference between the two. You don't have speech, but you do have electronic transaction records. Everything the bank teller is doing is being captured. Not just our project - there is an emerging set of technology that will make it easier to connect human in tandem and signals to context. And a lot of our corporate sponsors see that and understand that. That's another direction that this work may well head.

Q: Your data sets are organic. The computing that you;re using is fairly large. When do you think the bank teller or retail store will have the computing that's equivalent to have it done in real time?

Deb: It is already. The cost of our storage system was high 3 years ago, today is feasible for cog science labs, and in a few years will be a line item on your NIH budget. You could have done this 3 years ago if you were a bank or retailer. And because of 9/11 certain places have to archive video

Q: You're separating out and analyzing

Deb: Not a big deal

Q: How heavyweight is the software?

Deb: cost of computing dropping and the density is increasing. And our algorithms are getting better. Seagate is a partner, and wants to take a subset of data and make part of the firmware of the drive.

L: I'm curious - have you thought about comparative smaller projects? Comparative meaning: more family, culturally different family, a family that perhaps is in a different part of the world, a child who is an orphan, and a single parent household. Those would be 4 models where language acquisition could be radically different.

Deb: How we characterize this project, in many ways, we're trying to push the envelop. Trying to get a data set and engage the community. I have a daughter, and if you watch her trajectory, it's just total different. Is it all genetic? Are there behavior biases? There is a lot of skepticism about the technology from the child development community when we started, and one sign of success/metric would be to overcome that skepticism to get more people to do the things you're suggesting.

Comment: If you did more bite size things, you can anonymize the data/person.

Deb: I think there's great value in doing logitudinal testing. There's a history of diary studies in the field, but all sorts of theoretical biases. The diarist only notes what seems interesting. We're not susceptible to that.

Q: Let me first say, this total rocks. Let me steer towards the privacy stuff. I look at this and my first conclusion is that this would be interested toDHS . Have you thought through the implications of people looking at video in a different context - what if you have a person in a space that wants to do ill to people (example: a person who first walks in to  a space and looks for surveillance cameras). Increasingly in the US, and in the UK, quite common to have surveillance (just apanopticon effect  - no one is watching, but that is' there and know you're being recorded). With your tech, this could be a categorically different form of surveillance (who might be a person of interest in a scene). What have you thought through in terms of the nature of surveillance in this analysis?

Deb: A point of clarification: what we're doing is not real time analysis. Not just the computers being fast enough - but the whole method (speech transcription), and same goes for the video. State of the art video analytics would not allow you to track head orientation unaided by a human, the piece I didn't show there - a 2nd layer of software w/ human analysis.

Is DHS interested in this? Yes, they have a huge program - video analysis and content extraction. Many research teams around the country focused on this.

Q: You're taking $ retailers, this will likely be valuable to them, if and when this tool becomes useful for commercial apps, it will be deployed in other contexts, like security. How do you feel about that?

Deb: Type 1 and type 2 errors - you look at the larger context of what are people doing with the tools. You can watch behavior all you want, but you're not looking at cognition. Any time you do intention inference, there will be errors. I can't give a simple answer to how I feel about it.

Q: Following that, how efficient is your technology. Like where your son is saying water? Has that been clarified where he is saying water? Or does someone filter that out?

Deb: For accuracy, this is a well known problem for earlier language annotation. One of our first speech transcribers was a part time nanny and part time transcriber. Unless you spent a lot of time watching the video, you wouldn't know what "gaga" meant. For the mature speech forms, a lot of the focus is more on "what did he hear and in what context?" But the cost is a real issue, this would typically cost a few million $ to transcribe this data.

Q: I'd just like to hear you speculate some more on your view of a roadmap and the bottlenecks. Seems like the human view is already the bottleneck. $120k for 16 million words. Are there any technologies you see for circumventing the human bottleneck?

Deb: Real time is typically when you want to interact with systems. All of our robotics work are real time systems. Everything I talked about today is offline. In terms of human bottlenecks, in tech terms, that's automatic speech recognition. A lot of people say that it's been solved, but they're wrong. If you plug a conversation into the auto. speech recognition technology, you get single digit accuracy. We have hit a wall and need to research to get us unstuck. When you have a stenographer or a closecaptioner, they do it quite a bit faster than the technology.

Q: So what have you learned about human speech acquisition that you didn't know?

Deb: Nothing!

Q: Do you have a hypothesis?

Deb: I say nothing b/c the project has 3 phases: 1) data collection, 2) creating tools, 3) analysis. In the midst of phase 2. When you look through the literature on early language acquisition - there are many theories, and very little data. And many times they make contradicting predictions. Hypotheses: To look at cues that a child may be using to bias what they attend to, and to meanings/mappings they do. There are multiple competing objects/events in the scene, vs. not. How important are each of these in giving you a leg up on learning. But we have no idea how they interact. And other simple things - i think spatially, co-location is really important. Are we moving, is there a 3rd person, etc? We can systematically look at any combination of factors and look at the subsets (predictive value of later productions).

Q: Wanted to shift this to some questions for the group as a whole. I think there are a lot of interesting questions about privacy, data ownership. What if you buy a new condo and it came wired like this? What are the steps we would need, as a data safe practice, to have this happen? Releases to people who walk into your front door? Who has the rights over certain data?

Deb: We are talking to a toy manufacturer who may later institute this in homes.

Q: I have written a lot about this in hospital medical records. It has to do with networking of information (that doctor/patient) includes. We discovered that the merchant does not understand the difference b/w their info being correlated across visits, but the aggregation across different hospitals of something that needs to be voluntary.

Q: Ethan's original question - if companies collect purchasing data, where do you draw the line?

Deb: Maybe even just keeping it in the domestic environment.

Q: What is it to be informed? Informed consent - which is diff, from a display of a camera to be reminded that you are being recorded. In cause your curious of how we deal with us in our house, one of the things thatIRB asked us to do - to have a placard when you enter our house "everything is being recorded and may be posted on the internet ". What we ended up doing - we took it down, and made a convention in our house, we turn recordings off. And only friends and family who know us well give permission (but by default, whenever anyone enters) was turned off.

Q: Does your son have any inkling of the cameras? And how do you intend to tell him about it?

Q: He found the controllers, but then got bored of them. Is he aware?

Deb: No, I wouldn't say overall he's aware. We're not recording much anymore. There's not a day we start recording, the density just dropped off.

Ethan: There's a sense that the name draws on the human genome project. Whose genome are we sequencing? Are there any worries that idiosyncrasies in your son's speech affect the development of this field? CraigVentner...

Deb: That ambiguity of whose genome comes from popular media, and the estimate of how much difference there is. When you look at certain diary studies, there aren't many detailed ones that were done, and were influential in raising new questions. We have to have larger sets of subjects - hypothesis generating.

Q: Because the way research funding operates, you seem to be looking at this as a business application. What ways do you for see this as being used as a consumer application? Do you think it's feasible?

Deb: I would say that the project would not have happened if I had to make a case for commercial viability of DHS or others.

Q: Do you see this as being applied in a consumer setting first? Because the privacy concerns will be bypassed?

Deb: I don't know. When you record something in a store and the data is repurposed for buying habits.

Comment: Vegas uses that data extremely different from retailers.

Comment: Another question - the crossover thing that Ethan mentioned. Congress goes to AT&T, and gives away records,

Q: In a slightly less sinister sense - we've all gotten use to a level of surveillance. Background panopticon effect where we know that we're being watched. But this system goes further - in 5/10 years - suddenly, this is a really different level of surveillance that comes into play. Question: What's your reaction to public space implications? Will you handle moving in public spaces 10 years from now that's useful to commercial entities andDHS? (Comment: Facebook - / but there was an active rebellion to Facebook).

Q: People recording their lives, everyone will do it, (Justin.tv), or so I've heard. Useful as an aid to memory.

Deb: The question: not whether you capture the data, but what you do with that. My expectation is expectation tied to intent of the person capturing the data. My suspicion is that if you've caused someone great embarrassment or loss, that person will come after you (it's about intent and purpose).