Computer Vision for Human-Computer Interaction
Dozent: Prof. Stiefelhagen und andere
- Mensch-Maschine-/Mensch-Roboter-Interaktion
- Um mit Anwendern zu interagieren ist Sehen ziemlich wichtig
- Auch Wahrnehmung/Interpretation von Gesten, Affekt, Mimik, …
- Smart Houses / Assisted Living
- Smart Cars (Blickrichtung, Aufmerksamkeit)
- Smart Rooms (Wo ist wer im Meeting, Analyse)
- Smart Homes (z.B. Monitoring von Patienten während dem Schlaf)
- Assistive Technologie für Blinde
ML Basics
- Bayesian Classification
- Gaussian Mixture Models
- Expectation Maximization
- Linear Discrimant Functions
- Perceptrons
- SVMs
- Soft Margin
- Kernel Trick
- Linear SVMs (much faster, can perform well on high-dimensional data, too)
- Multi-Class SVMs
- Tips: normalization is important, parameter selection is important (grid search)
- k-means
- Agglomerative Hierarchical Clustering
Model Taxonomy
- Parametric
- More data
- Non-parametric
- More assumptions
- Generative
- Model the distribution
- Can generate new data
- Can be converted to discriminative model with Bayes’ Rule
- Discriminative
- Only model
- Generally outperform generative models in classification tasks
Dimensionality Reduction
- Curse of dimensionality
Receiver operating characteristic (ROC)
Principal Component Analysis (PCA)
- Find principal components (= eigenvectors of covariance matrix) of data
- Project data on the largest k however-many-dimensions-you-want vectors
Linear Discriminant Analysis (LDA)
Like PCA but different
Face Detection
Color Based Face Detection
Idea: find image parts with skin color, fit face to biggest cluster/elliptic cluster/heuristics, …
- Fast
- rotation/scale invariant
- occlusion resistant
- color varies wildly with illumination
- cannot distinguish body parts
- skin-colored objects
Color Spaces
- Class Y Spaces
- e.g. YUV
- Y contains brightness
- Perceptually uniform spaces
- e.g. CIE-Lab
- Euclidian distance can be used for color comparison
- Chromatic Color Spaces
- e.g. HS (from HSV), UV (from YUV), normalized RG (from RGB)
- may be more robust to lighting
Histogram (non-parametric)
Build Histogram for skin and non-skin from training data.
Histogram Backprojection
= Compare color of single pixel with histograms
- Fast and simple
- Sucks with multi-modal distributions (black people?)
Histogram Matching
= Compare color histogram of image patch with histograms (e.g. Battacharya distance)
- Better performance than backprojection
- works with multi-modal distributions
- slower
Perceptual Grouping
- Morphological Operators: Erosion and Dilatation
- Erosion, then dilatation = opening
- Dilatation, then erosion = closing
What works best?
- Bayesian (needs lots of memory), MLP, 3D Gaussian Skin+Nonskin
- Color space largely irrelevant, but results degrade with only chrominance channels
Neural Network Based Face Detection (Rowley, Baluja & Kanada)
- Sliding window (20x20) on different scales of picture
- Proprocessing with best fit linear function to correct lightning
- Apply ANN on windows as “filter for faces”
- Speed-ups:
- Increase step size of sliding window
- Hierarchical search
Histogram Equalization
Transform each pixel of an image of size , so that the cumulative distribution function is linearized.
Feature-based Face Detection (Viola & Jones)
- Sliding window
- Compute Haar-like features over each window
- use integral image(= summed area table)
- over 180k features/classifiers per sliding window
- Combine classifiers with (something like) AdaBoost
- Train classifier
- Calculate error
- Weight classifier higher, if error is smaller
- Weight training example higher, if false classified
- repeat for each classifier
- very fast but still too many features
- => Classifier Cascade
- Multiple layers of classifiers with low false negative rate
- Reject immediatly on a negative
- Classify as face if window passes through all layers
- Robust
- Fastest known face detector
- Can be trained for general objects
Face Recognition
- Why?
- Accurate (Google FaceNet 2014: 95-99% – human/superhuman performance)
- Can use existing infrastructure (cameras)
- non-intrusive (No physical interaction)
- Only biometric for passive identification in 1-to-many scenarios (find a person in the airport etc.)
- Applications:
- HCI, human-robot-interaction
- Smart Cards (driver’s license, ID,)
- Surveillance
- Problems:
- Changes in illuminations > changes due to face identity
- Head pose
- Occlusion
- Types:
- Closed-Set Identification – Test image person is in database
- Open-Set Identification – Person may not be from database
- => False Classify / False Accept / False Reject
Feature-based FR
- Caculate features like “face width at nose position”, “eyebrow thickness at eye center position”
Classify with KNN using Mahalanobis distance
- covariance matrix - average vector representing th person
Appearance-based FR
- holistic: process whole face
- local/fiducial: process facial features (eyes, mouth, …) seperately
- align face with facial landmarks
- rescale and crop to face
- => removes translation, rotation and scaling factors
- Calculate principal components (eigenfaces) of training faces
- Keep k eigenfaces that correspond to the highest eigenvalues => face space
- For each person, calculate representation in face space
- Eigenfaces look like Casper the Ghost
- Project new image onto face space
- Find most likely person by distance comparison
View-Based Eigenspaces
- Generate a face space for each head orientation
- Decide input image’s view with distance from view space metric
- do classification in that view’s face space
Bayesian Face Recognition
- Nearest-neighbor doesn’t exploit knowledge of critical appearance varations
- Bayesian similarity measure is a probabilistic measure that does this
- compares within-class (intrapersonal) variations with between-class (extrapersonal) variations
- Do that with dual PCA (intrapersonal & extrapersonal)
Like Eigenfaces but with LDA (maximizes between-classes seperation)
Modular Eigenspaces
- Classify on fiducial regions (eyes, nose, …) instead of whole face
- Converges faster
Local PCA / Modular PCA
- Divide face into n subimages
- Apply PCA to each subimage
- Does not use class information
- Much of the variation between images is due to illumination changes
- => Fisherfaces
- Does not distinguish between shape and appearance
- = degrades when head orientation must be matches
- => Activate Shape Models (ASM)
- => Active Appearance Models (AAM)
FR across pose
Active Appearence Model (AAM)
Fit mesh on face (w/ inverse compositional algorithm)
Shape Model
Appearence Model
- Normalize image to frontal pose (affine transformation of mesh triangles)
- Can compare faces now (e.g. local DCT approaches (discrete cosine transform))
3D Face Model Fitting
FR using 3D Models
- Database of 3D Scans
- Label 7 fiducial points manually
- Fit 3D Model (shape and texture vectors) to input image
Local Feature-based FR
- Local analysis
- fuse outputs of local features at decision or feature level
Elastic Bunch Graphs
- 40 complex Gabor wavelet coefficients (a Jet) per pixel
- graph of n jets
- bunch graph = differents jets for different poses/appearences (“closed eye”, “open eye”, …)
Gabor Filters
- 2D sine wave similiar to cells in the visual cortex
- Many scales, orientations
- => PCA
Local Binary Pattern Histogram
- Divide image into cells
- Threshold each pixel depending on cumsum of surrounding pixels
- Better than simple gradients
- Can effectively remove outliers
Dense Features
- Compute features for lots of overlapping patches in many scales
- Millions of features
- => Encode into compact form
- Bag of Visual Word Model
- Fisher encoding
- Captures the average first and second order differences between features and the centres of a GMM
- stack difference vectors
- => 3.2m features -> 130k features
- Subspace learning for further compression
- => 130k features -> 1000 features
- State of the art performance on faces in the wild
Deep Learning
Q: Why can’t the mapping between layers be linear?
A: Compositions of linear functions is linear, whole network collapses to regression.
Q: What does a hidden unit do?
A: Can be thought as a classifier or feature computer.
Q: How many layers? How many hidden units?
A: Hyper-parameter setting best done using cross-validation. In general wider and deeper networks allow for complicated “function” mappings.
Q: Why do we need many layers?
A: Data with hierarchical structure is well exploited with a hierarchical model architecture where intermediate features can be re-used.
Soft Max
Squash vector of reals to range [0, 1] so they add up to 1.
Facial Expression Recognition
I got kind of pressed for time here and didn’t summarize the rest:
Head Pose Estimation
People Detection
Gesture Recognition
Action Recognition
- Laptevs Action Features