AudioImager: Difference between revisions

From Berkman Klein Google Summer of Code Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(3 intermediate revisions by 2 users not shown)
Line 15: Line 15:
The software uses several technologies to tag the audio with keywords. Speech recognition is the primary method for the tagging. However, it is clear that the current speech recognition systems have deficiencies and are not able to 100% accurately to tag audio. This is why the software will provide user ability to delete and correct the machine created tags. One of the tasks of the programmer will be to test and choose the open source voice recognition software that will be best suited for the AudioImager.
The software uses several technologies to tag the audio with keywords. Speech recognition is the primary method for the tagging. However, it is clear that the current speech recognition systems have deficiencies and are not able to 100% accurately to tag audio. This is why the software will provide user ability to delete and correct the machine created tags. One of the tasks of the programmer will be to test and choose the open source voice recognition software that will be best suited for the AudioImager.


www.cmusphinx.sourceforge.net  
*http://www.cmusphinx.sourceforge.net  
www.voxforge.org/  
*http://www.voxforge.org/  
http://www.kaltura.org  
*http://www.kaltura.org  
http://www.linux.com/archive/feature/134671  
*http://www.linux.com/archive/feature/134671  


The software can either be implemented as a web service or as a standalone application.
The software can either be implemented as a web service or as a standalone application.
The editing could be done for example locally with a java application or in a browser-JavaScript.


'''Media and licensing'''  
'''Media and licensing'''  


There are close to 90 million photos on Flickr that are licensed with Creative Commons licenses which permit the creation of derivative works. Some of the licenses require that the adapted works that use the work will be licensed with similar licensing terms. Some only permit the use of the photos for non-commercial projects. The user of the software must be able to tell the software the intended use of the video. For example if the video will be used for commercial use, the software should only use pictures that permit such use. The software should also make sure that it does not accidently mix works that are license-wise incompatible.  
There are close to 90 million photos on Flickr that are licensed with Creative Commons licenses which permit the creation of derivative works. Some of the licenses require that the adapted works that use the work will be licensed with similar licensing terms. Some only permit the use of the photos for non-commercial projects. The user of the software must be able to tell the software the intended use of the video. For example if the video will be used for commercial use, the software should only use pictures that permit such use. The software should also make sure that it does not accidentally mix works that are license-wise incompatible.
 
One interesting use case could be the making of a music video. However, the music should be either the user's own or preferably CC-licensed. Many of the ideas to tag the music videos have included an idea on getting the lyrics from online services. Those services typically have lyrics for only popular music which does not include proper licenses.


Other use case could be to take lectures or political speeches like "State of the Union" and illustrate them with images. Software voice recognition has much better accuracy when there  is  no background music.
'''Tasks'''
'''Tasks'''
*Evaluate the existing open source technology for voice recognition and video editing.  
*Evaluate the existing open source technology for voice recognition and video editing.  
Line 45: Line 51:
Here is an example of what the software output could be:
Here is an example of what the software output could be:
http://video.google.com/videoplay?docid=8060206257543341917#
http://video.google.com/videoplay?docid=8060206257543341917#
*[[Use cases]]

Latest revision as of 13:26, 8 April 2010

Use of the software

  1. User provides an audio or a video file to the software.
  2. The software analyses the audio of the submitted file.
  3. The software creates a timeline of the file and places keywords that it has extracted from the audio to the time line.
  4. User can play the audio, examine the keywords and make corrections to the keywords and place them to the right place on the time line.
  5. After the user has accepted the keywords the software will find pictures with matching tags and keywords. The pictures can be transferred from online or from local hard drive.
  6. The software suggests a matching picture and if there are other pictures with matching tags the software presents those as alternative photos.
  7. The user can choose to keep the default photo, change it to other suggested photos or ask the software to get more photos to choose from. At this point the user is given a chance to change the key words. Software retrieves new pictures after the key word change. User can also choose to have a black screen or the original video instead of the photos.
  8. When user has approved the photos, the system suggests photo transitions.
  9. User can tweak the transitions and submit the project for video rendering.
  10. Software composes a video and end credits that include the relevant licensing information.

Technology

The software uses several technologies to tag the audio with keywords. Speech recognition is the primary method for the tagging. However, it is clear that the current speech recognition systems have deficiencies and are not able to 100% accurately to tag audio. This is why the software will provide user ability to delete and correct the machine created tags. One of the tasks of the programmer will be to test and choose the open source voice recognition software that will be best suited for the AudioImager.

The software can either be implemented as a web service or as a standalone application.

The editing could be done for example locally with a java application or in a browser-JavaScript.


Media and licensing

There are close to 90 million photos on Flickr that are licensed with Creative Commons licenses which permit the creation of derivative works. Some of the licenses require that the adapted works that use the work will be licensed with similar licensing terms. Some only permit the use of the photos for non-commercial projects. The user of the software must be able to tell the software the intended use of the video. For example if the video will be used for commercial use, the software should only use pictures that permit such use. The software should also make sure that it does not accidentally mix works that are license-wise incompatible.

One interesting use case could be the making of a music video. However, the music should be either the user's own or preferably CC-licensed. Many of the ideas to tag the music videos have included an idea on getting the lyrics from online services. Those services typically have lyrics for only popular music which does not include proper licenses.

Other use case could be to take lectures or political speeches like "State of the Union" and illustrate them with images. Software voice recognition has much better accuracy when there is no background music. Tasks

  • Evaluate the existing open source technology for voice recognition and video editing.
  • Evaluate which software licenses are used in the existing software and which license the AudioImager should use.
  • Create use-cases for the software.
  • Create a mock-up user interface that facilitates all the use cases.
  • Implement the voice recognition tagging feature.
  • Implement the manual tagging feature.
  • Implement the photo retrieval feature.
  • Implement the video editing feature.
  • Implement the end credit feature.
  • Document all features.

Nice to have features

  • Web interface for using the software.
  • Real time photo retrieval to match what is being said by a live presenter (human/machine tagging).
  • A feature to upload the finished video to Youtube, Flickr and Facebook.

Here is an example of what the software output could be: http://video.google.com/videoplay?docid=8060206257543341917#