EECS 442 Computer Vision: (Optional) Homework 6

Starting from:

$35

EECS 442 Computer Vision: (Optional) Homework 6
Instructions

• The submission includes two parts:
1. To Canvas: submit a zip file of all of your code.
We have indicated questions where you have to do something in code in red.
We have indicated questions where we will definitely use an autograder in purple.
Please be especially careful on the autograded assignments to follow the instructions. Don’t
swap the order of arguments and do not return extra values.
If we’re talking about autograding a filename, we will be pulling out these files with a script.
Please be careful about the name.
Your zip file should contain a single directory which has the same name as your uniqname. If I
(David, uniqname fouhey) were submitting my code, the zip file should contain a single folder
fouhey/ containing all required files.
What should I submit? At the end of the homework, there is a canvas submission checklist
provided. We provide a script that validates the submission format here. If we don’t ask you for
it, you don’t need to submit it; while you should clean up the directory, don’t panic about having
an extra file or two.
2. To Gradescope: submit a pdf file as your write-up, including your answers to all the questions
and key choices you made.
We have indicated questions where you have to do something in the report in blue.
You might like to combine several files to make a submission. Here is an example online link
for combining multiple PDF files: https://combinepdf.com/.
The write-up must be an electronic version. No handwriting, including plotting questions.
LATEX is recommended but not mandatory.
Python Environment
We are using Python 3.7 for this course. You can find references for the Python standard library here:
https://docs.python.org/3.7/library/index.html. To make your life easier, we recommend you to install the
latest Anaconda for Python 3.7 (https://www.anaconda.com/download/). This is a Python package manager
that includes most of the modules you need for this course.
We will make use of the following packages extensively in this course:
• Numpy (https://docs.scipy.org/doc/numpy-dev/user/quickstart.html).
• Matplotlib (http://matplotlib.org/users/pyplot tutorial.html).
• OpenCV (https://opencv.org/).
1
1 Foreword
1.1 The Setup and Math You May Want
This setup is best read alongside the lecture slides.
You already know some of this stuff, but often it takes seeing something a few times to get it.
Preemptively, I am being a bit hand-wavy in parts and I am dancing around using cleaner explanations that
depend on more advanced linear algebra. My apologies to the math department.
1.1.1 Cross products
Despite the fact that we effectively live in a 3D space, linear algebra classes often do not teach the cross
product ×. The cross product of two vectors a and b, denoted a × b, just makes a third vector c that points
in a direction that is orthogonal to a and b (i.e., c
T x = 0 and c
T y = 0), and has norm ||c||2 that is equal to
the area of parallelogram with a and b as sides. The orientation of c is determined by the right-hand rule.
Rather than call np.cross or something, you can also compute the cross product with a fixed vector
x = [x1, x2, x3] by the following form involving a skew-symmetric matrix [x]×:
x × y = [x]×y =


0 −x3 x2
x3 0 −x1
−x2 x1 0

 y (1)
Once you have the cross-product, you can also produce the triple product z
T
(x × y) or the dot product of
z with a vector that is perpendicular to x and y. This somewhat cryptic expression is useful in computer
vision since it is a good test for co-planarity. Namely, if the triple product is zero, then z is co-planar with x
and y (it has zero projection onto the normal for the plane defined by the origin, x, and y).
1.1.2 Homogeneous Coordinates
At this point, you’re familiar with homogeneous coordinates, but we’ll just recap them.
Homogeneous coordinates are the 2D coordinates that you know and love with an extra dimension. If you
have a 2D point [u, v], you can convert it to a homogeneous coordinate by concatenating 1 and making
[u, v, 1]. If you have a homogeneous coordinate [a, b, c] you can convert it to an ordinary 2D point by
dividing by the last coordinate c, or [a/c, b/c]. If c = 0, this doesn’t work; a point with c = 0 is at “the line
at infinity” (where parallel lines converge).
Intuition: When we have a point on the imaging plane, we’ve made it a homogeneous coordinate by concatenating a 1. You can think of this point as really representing a line from the origin to the point on the
image plane. We often first define the point to be on a plane with depth 1 although all vectors that face the
same direction are equivalent (except when the depth is 0).
Checking for equivalence: Two homogeneous coordinates p and p
0
represent the same point in the image if
they are proportional to each other, or there is some λ 6= 0 such that p = λp
0
. Testing if there exists a λ is
tricky. Instead, it’s useful to generate constraints in computer vision by testing if p × p
0 = 0 (i.e., the cross
product is zero). Note the equals! This works because the magnitude of the cross product is determined by
the area of a parallelogram with p and p
0
as its sides. If they point in the same direction, this area is zero.
It’s ok if you didn’t see this in your Linear Algebra class.
2
1.1.3 Homogeneous Coordinates and Lines
One of the nice things about homogeneous coordinates is that they make points and lines the same size. If
I have a 2D points in homogeneous coordinates x ∈ R
3
and y ∈ R
3
and a line l = [a, b, c]
T ∈ R
3
, then x
lies on the line l (i.e., ax + by + c = 0) if and only if l
T x = x
T
l = 0. But critically, you can interpret any
3D vector as a point or a line. This leads to lots of genuinely wonderful results, but also makes dealing with
points and lines a lot easier than most of what you’re inclined to do based on high school math.
Intuition for Lines: You can think of a line on the image plane as really being a plane that passes through
the origin as well as the image plane. The line coordinates are actually the normal of that plane. The usual
3D plane equation involving a normal n and offset o, namely n
T
[x, y, z] + o = 0 is simplified by the fact
that the plane passes through the origin so o = 0.
Intersection of two lines: given two lines l1 and l2, their intersection p is given by l1 × l2. This can be seen
algebraically by the fact that p = l1 × l2 has to be orthogonal to both (i.e., p
T
l1 = 0 and p
T
l2 = 0). So
naturally, p has to satisfy both lines’ equations. Geometrically, you could think of this as constructing a
direction that is perpendicular to both planes’ normals: this has to lie on both the planes, and therefore is the
intersection of the planes.
Line through two points: given two points x and y, the line through them is given by x × y. Algebraically,
this follows for the same reason why the intersection of two lines is the cross product. Geometrically, you
can think of this as constructing a plane that is perpendicular to both directions. Draw a line from the origin
to both points. The origin and both points form a plane whose normal is proportional to x × y.
1.1.4 Camera Projection
Suppose you have a 2D point p = [u, v, 1] and a 3D point X = [x, y, z, 1] (capitalized to be consistent
with the slides, where it’s capitalized to make it more obvious it’s 3D). The camera that projects things to
the world is composed of three key components: an intrinsics matrix K ∈ R
3×3
, and a rotation R ∈ R
3×3
and translation t ∈ R
3
. The rotation and translation join together to form the extrinsics matrix [R, t], which
is the horizontal concatenation of the two matrices (i.e., a member of R
3×4
). The intrinsics matrix K is
intrinsic because it doesn’t change with the location of the camera. The extrinsics matrix is the matrix of the
camera’s properties with respect to the external world.
The matrices aren’t all freely choosable. The only one that is freely choosable is the translation, which can
be picked arbitrarily. Since R is a rotation, it only has 3 degrees of freedom despite having 9 coordinates.
In its most general form, K might just be an upper triangular matrix. But usually K is assumed to be of the
form:
K =


f 0 cu
0 f cv
0 0 1

 . (2)
The particular 3-parameter form of K assumes that pixels are square and there is no skew (i.e., the axes
of the image are perpendicular). Both seem like obvious requirements. While they aren’t always met in
practice, they usually hold (or rather hardware manufacturers spend a lot of energy making sure they hold).
Personally, I’ve only encountered one camera that has non-square pixels – a solar telescope (Hinode/SOTSP) that creates images by sweeping a line of pixels across the Sun. However, I am told that it’s often useful
to have parameters that absorb modeling errors (i.e., even if the pixels are square, pretending they might be
slightly wrong might handle other unmodeled issues).
In total, projection should satisfy p ≡ K[R, t]X. This can be grouped in a number of equivalent ways
3
to get some insight into what’s going on. If you multiply all the camera matrices, you can get a 3 × 4
matrix M = K[R, t]. Then p ≡ MX. Alternatively, you can think of this as p ≡ K([R, t]X), or X gets
transformed by R and t, and then projected by K.
1.1.5 Homogeneous Least-Squares Redux
Often we set up fitting problems where we would like to find p parameters of a model x (i.e., x ∈ R
p
) and
we can create an equation a such that if a
T x = 0, then the model fits well. In other words, we can generate
an equation (whose terms are encoded in a) that is linear in x that is satisfied if x is a good model. If we get
N of these equations, we stack all of these together to make a matrix A with N rows. This gives a system
of equations:
Ax =



− a1 −
.
.
.
− aN −



x =



0
.
.
.
0


 . (3)
Two things make this a little tricky. First, you can always trivially satisfy this by setting x = 0. Second,
if you have p < N, then things are overconstrained, so no non-zero vector will actually satisfy all of these
equations. These difficulties are overcome by instead finding the unit vector x ∈ R
p
that minimizes ||Ax||2
2
.
The unit vector x
∗
that minimizes ||Ax||2
2
ends up being the eigenvector of (AT A) with smallest eigenvalue.
Why Does This Work? If we expand out ||Ax||2
2
, we get(Ax)
T
(Ax) and then x
T
(AT A)x. Our requirement
that ||x||2 = 1 just comes the fact that the value of ||Ax||2
2
can be arbitrarily changed by scaling x. So the
thing we’d really like to optimize for accounts for this scale and is:
x
T
(AT A)x
xT x
=
x
T
(AT A)x
||x||2
2
=
x
T
(AT A)x
||x||2 · ||x||2
=

x
||x||2
T
(AT A)

x
||x||2

. (4)
From here, check out the Rayleigh Quotient. This is a useful property of eigenvectors. If have a maximization problem, the vector that maximizes the property is – you guessed it – the eigenvector corresponding to
the maximum eigenvalue.
Why Doesn’t This work? While x really should satisfy Ax = 0 if things are perfect, the value ||Ax||2
P
2 =
N
i=1(a
T
i x)
2 often doesn’t have any real meaning. So while the minimum of ||Ax||2
2
is probably not a bad
idea, it’s not clear that making that value small corresponds to making the estimates of geometry better. So
in practice, it’s common to use homogeneous least-squares to get a good first guess that you polish off by
using a non-linear optimizer on some other cost function (typically expressed as a distance between where
points should be in 2D and where they are).
TL;DR: The too-long version is: we give you a design matrix A such that a model x that fits perfectly should
satisfy Ax = 0. You then compute the eigenvector corresponding to the smallest eigenvalue of AT A.
Alternatively, you can also get the last singular vector of the right-singular vectors of a SVD UΣVT = A
of A (V are eigenvectors of AT A).
1.1.6 The Closest Rank-k Matrix
Given a matrix M ∈ R
n×n
, we often want to find some matrix Mˆ
k that is as “close” to M as possible, but
which has a lower rank (e.g., some k with k < n).
Rank Reminder: Recall that the rank of a matrix is the dimension of the space spanned by its vectors.
Matrices of size n × n that are rank deficient do not span the full space of R
n
. Equivalently, they do not
have an inverse and have a non-trivial nullspace.
4
We can solve for the closest lower rank matrix with an SVD. Suppose UΣVT = M is the singular
value decomposition (SVD) of M. The matrices U, VT
are just rotation matrices, and Σ should be
diag([σ1, . . . , σn]) with σi ≥ σi+1. In other words, Σ is a diagonal matrix with the ascending singular
values σi along the diagonal. Let’s make Σk = diag([σ1, . . . , σk, 0, . . . 0]) by replacing σk+1 and later singular values with zeros. Then, we can make a matrix Mˆ
k = UΣkVT
, which consists of the original SVD
of M, but with all but the first k singular values set to 0.
One nice result is the Eckart–Young–Mirsky theorem, which tells us that this is not a ridiculous idea. Specifically, Mˆ
k is actually the closest matrix of rank k to M! In other words
Mˆ
k = arg min
M0∈Rn×n,rank(M0)=k
||M0 − M||. (5)
I’m leaving the norm unspecified, because this holds for both (a) the Frobenius norm || · ||F , which is a
fancy name for taking the square root of the sum of the squares of all the entries; and (b) the spectral norm
|| · ||2, which is the largest singular value (intuitively a measure of the maximum length a unit norm vector
can have after being multiplied by the matrix).
1.1.7 Finding the Null Space
Given a n×n matrix A with reduced rank, you can find a basis for the nullspace by the SVD. In a particularly
useful instance, suppose A has rank n − 1, and if UΣVT
is the singular value decomposition of A. Then,
the left nullspace of A is given by the last column of U and the right nullspace by the last column of V.
You have to be careful with how linear algebra packages return results. Just as you have to be careful with
whether the eigenvectors are rows or columns, you should be careful here too. If you think you have a
solution x for the right nullspace, try computing Ax; for the left nullspace, try computing x
T A. These
should be close to the zero vector.
1.2 Linear Camera Calibration
If I have a set of N 2D points pi ≡ [ui
, vi
, 1] and 3D points Xi
, I know that pi ≡ MXi should hold. This
also means that pi × MXi = 0. You can use this to derive that:


0
T XT
i −viXT
i
XT
i 0
T −uiXT
i
−viXT
i uiXT
i 0
T




MT
1
MT
2
MT
3

 = 0, (6)
where M1, M2, and M3 are the three rows of M. If you re-derive this yourself, you may have to flip a sign.
There are actually only two linearly independent equations in this matrix: you can see this by multiplying
the top row by ui and the middle row by −vi and then adding the two of them to the third row.
Given that you have a matrix of the form Ax = 0, you’re set to use homogeneous least-squares! It’s
probably best to just grab the first two rows for of the resulting matrix per 2D/3D correspondence.
1.3 The Essential Matrix
Suppose we have two views of a scene. If we have access to the intrinsics of our cameras, the Essential
Matrix represents the relationship between the two views.
5
Setup: Suppose K, K0
are the intrinsics of images 1 and 2 and the mapping from camera 1’s coordinate
system to camera 2’s is given by a rotation R ∈ R
3×3
and translation t ∈ R
3
. This means that the projection
matrix for image 1 is K[I, 0] and the projection matrix for image 2 is K0
[R, t]. We’ll then assume we have a
point in 3D X that projects to pixel location p in image 1 and pixel location p
0
in image 2, or p ≡ K[I, 0]X
and p
0 ≡ K0
[R, t]X.
Normalized Coordinates: Note that p, p
0
are defined in image coordinates. Since they are 2D objects in the
image plane, you cannot combine them with 3D objects in the world. However, if we undo the projection, or
compute pˆ = K−1p and pˆ
0 = K0−1p
0
, we now have the rays in 3D that go through the origins of cameras 1
and 2. Mechanically, you can now see that pˆ ≡ [I, 0]X and pˆ
0 ≡ [R, t]X. These geometrically meaningful
representations of the coordinates are called normalized coordinates
One annoying detail is that pˆ is defined in camera 1’s coordinate system and pˆ
0
is defined in camera 2’s
coordinate system. That said, we have the rotation and translation that transforms them. When working
through this, be careful of what coordinate system each vector is in!
Deriving E: We’ll start off with the rays pˆ = K−1p and similarly for pˆ
0 = K0−1p
0
. We can convert the
ray pˆ from camera 1 to camera 2’s coordinate system by the matrix R. Then, the following vectors must be
all co-planar: Rpˆ, pˆ
0
and t. This is satisfied when their triple product pˆ
0T
(t × Rpˆ) = 0. One can expand
this out to form
pˆ
0T
(t × Rpˆ) = 0 → pˆ
0T
[t]×Rpˆ = 0. (7)
The Essential Matrix E is [t]×R. Then, given any points pˆ, pˆ
0
in normalized coordinates, they must always
satisfy pˆ
0T Epˆ = 0.
Writing E out explicitly: To point to how the Fundamental Matrix works, let’s just write the full expression
out from pixels on one side to pixels on the other. First, let’s do it in a somewhat reasonably grouped setting:
(K0−1p
0
)
T E(K−1p) = 0, (8)
which we can re-arrange a bit to make the nicely symmetric form
p
0T K0−T EK−1p = 0. (9)
Decomposing E: Given a R, t, there’s only one E (up to a scale). On the other hand, given an E (for
instance fit by correspondences), there are multiple R and t that could have produced it. However, since E =
[t]×R with R a rotation matrix and [t]× being skew-symmetric, there are only a few options/well-behaved
ambiguities. Specifically, one has to choose between one of two rotations and a sign for the translation.
Finally, there’s a fundamental scale ambiguity for t.
1.4 The Fundamental Matrix
If we don’t have information about the intrinsics of the camera, the Fundamental Matrix represents the relationship between the two cameras. What’s exciting is that the Fundamental Matrix exists for any two images
capturing the same scene and it constrains how the points could be related without ever reasoning about the
3D scene explicitly! If you can estimate the Fundamental Matrix from a few strong correspondences, then
you can constrain all the other matches.
Specifically, suppose a point p ≡ [u, v, 1]T
in image 1 and a point p
0 ≡ [u
0
, v0
, 1]T
in image 2 are both the
result of projecting some 3D point X by projection matrices M1 and M2. In other words p ≡ M1X and
p
0 ≡ M2X. Then the Fundamental Matrix F between the two cameras satisfies the relationship p
0T Fp = 0,
6
or expanding out a few times
p
0T Fp = 0 →

u
0 v
0 1

F


u
v
1

 = 0 →

u
0 v
0 1



f11 f12 f13
f21 f22 f23
f31 f32 f33




u
v
1

 = 0. (10)
You can multiply Fp explicitly to get find an expression as the dot product between two vectors, or
[u
0
, v0
, 1]


uf11 + vf12 + f13
uf21 + vf22 + f23
uf31 + vf32 + f33

 = 0, (11)
and then multiplying further you get
u
0uf11 + u
0
vf12 + u
0
f13 + v
0uf21 + v
0
vf22 + v
0
f23 + uf31 + vf32 + f33 = 0. (12)
You can pull out all of the entries in F to yield
[u
0u, u0
v, u0
, v0u, v0
v, v0
, u, v, 1][f11, f12, f13, f21, f22, f23, f31, f32, f33]
T = 0. (13)
This expression is not particularly enlightening, but is useful for fitting the matrix.
Fundamental Matrix Facts: F has a number of useful properties. Again, suppose a point p ≡ [u, v, 1]T
in image 1 and a point p
0 ≡ [u
0
, v0
, 1]T
in image 2 are both the result of projecting some 3D point X by
projection matrices M1 and M2. Note, however, we do not need M1 and M2 to reason via F.
1. Relationship to Essential Matrix: Given an Essential Matrix E and camera intrinsics K and K0
,
then the following relationship holds
F = K0−T EK−1
. (14)
One can think of this equation as operationally describing how the Fundamental Matrix works: it
is the Essential Matrix wrapped in intrinsic matrices that convert the pixel locations (i.e., in grid
coordinates) into honest to goodness 3D rays (i.e., x/y/z in the real world).
If we break down everything, we get the involved expression explaining how pixel locations p and p
0
must be related:
p
0K0−T
[t]xRK−1p = 0. (15)
This expression suggests that you won’t be able to uniquely decompose F into something clean: there
are a lot of matrices in this expression so changes in one can compensate for changes in another.
2. Epipolar Lines: If we have a point p in image 1, we don’t actually know in 3D that point came from.
We just know that it corresponds to a ray going out into the world. This ray, when projected back to
image 2 results in a line – recall that projection preserves lines – called an epipolar line. Note that
without the intrinsics, we do not know where this ray is; we only know that it exists.
Now, let’s look at generating that epipolar line from the Fundamental Matrix F. You can break up
p
0T Fp = 0 in two ways: p
0T
(Fp) = 0 and (p
0T F)p = 0.
Let’s look at the first one. Given that p
0
is a point, (Fp) can be thought of as the coefficients of an
equation of a line: p
0T
(Fp) = 0 is a way of expressing that that the point p
0
is on the line Fp. Thus,
the epipolar line for p (i.e., the line that p
0 must lie on) is given by Fp.
The second way of writing things, (p
0T F)p = 0 can be cleaned up a bit by transposing, yielding
p
T
(F
Tp
0
) = 0. This can then be seen as taking the point p and checking if it lies on the line (F
Tp
0
).
7
3. Epipoles: The epipoles are the projection of baseline between the two cameras into the images and
accordingly the projection of the each camera’s origin into the other’s. Suppose we know that we have
a point p in image 1. Geometrically speaking, this point p could result from X being at the origin of
camera 1. Thus, the epipolar line in image 2 must always contain the epipole.
Thus, the epipoles are also where all the epipolar lines intersect. This somewhat innocuous statement
implies something quite interesting about the nullspaces of F. Specifically, suppose e is the epipole
in image 1. Then, for every single point p
0
, if I construct its epipolar line p
0T F (i.e., the equation of
the line), e lies on this line every single time. This means that p
0T Fe = 0 irrespective of how I pick
p
0
. The only way this is true is if Fe = 0.
Thus, the epipole in image 1 e is the right-nullspace of F and the epipole in image 2 (by similar logic)
is the left-nullspace of F. Note that if you compute the nullspace, you will usually get a unit vector
from the software you’re using. To get the real coordinates of the epipole, you’ll have to scale things
so the last entry is one.
4. Rank of F: Given that F has non-trivial nullspaces (namely the epipoles), F is rank deficient. The
deduction that F is rank deficient comes from those usual long lists about equivalent statements in
linear algebra about matrices: full rank matrices are invertible, only have trivial nullspaces (i.e., 0),
etc. etc. In particular, F has rank 2 since the nullspace is 1D (all the scaled copies of e for right and
e
0
for left).
1.5 Solving for F
If we have pairs of points in correspondence [ui
, vi
] ↔ [u
0
i
, v0
i
], then we can make one row of a matrix per
correspondence, taking the form
U =




.
.
.
uiu
0
i uiv
0
i ui viu
0
i
viv
0
i
vi u
0
i
v
0
i
1
.
.
.




. (16)
This can be solved via standard homogeneous least-squares: if f contains all the parameters of F, then Uf
should be 0. We thus solve for f by finding the eigenvector with smallest eigenvalue of UT U.
Complication 1 – Rank Deficiency: The Fundamental Matrix is not full rank and the optimization will
possibly give us a full rank matrix F. So you can compute the closest low-rank matrix using SVD. The
thing that goes wrong with a full rank “Fundamentalish matrix” F
0
is that because it is full rank, it has no
non-trivial nullspaces. Therefore, the epipoles don’t exist and so the epipolar lines won’t converge!
Complication 2 – Numerical Stability: This algorithm is a terrible idea numerically on raw pixel locations.
So in practice, you need to rescale the data. The “In defense of the eight-point algorithm” paper has a
somewhat involved solution. You can also just make a homography T that scales things so that the center is
zero and the image is one unit wide. It’s a good idea to: (1) pre-compute the transformations as a matrix T,
(2) apply it to the data, (3) fit the fundamental matrix F on the transformed data, (4) finally return TT FT.
One can think of this form as just applying the transformation to both input points.
The numerical instability comes from the fact that the the matrix U contains entries like u
0u, u and 1. If the
image is of size 1, 000 × 1, 000, then the pixel coordinates u and u
0
are of size ∼ 103
. Then u
0u ∼ 106
,
u ∼ 103
and 1 ∼ 100
. Things are made worse if one tries to compute UT U: the range is then 1012
, . . . , 100
.
By rescaling things, the relative gap is much smaller.
8
2 Camera Calibration
Temple zrtrans reallyInwards
Figure 1: Epipolar lines for some of the datasets. The Epipoles are marked in little white circles. Best seen
by zooming in.
Task 1: Estimating M [20 points]
We will give you a set of 3D points {Xi}i and corresponding 2D points {pi}i
. The goal is to compute the
projection matrix M that maps from world 3D coordinates to 2D image coordinates. Recall that
p ≡ MX, (17)
and (see foreword) by deriving an optimization problem, you can solve for M. The script task1.py shows
you how to load the data. The data we want you to use is in task1/, but we show you how to use data
from Task 2 and 3 as well. Credit: The data from task 1 and an early version of the problem comes from
James Hays’s Georgia Tech CS 6476.
(a) (5 points) Fill in find projection in task1.py.
(b) (5 points) Report M for the data in task1/.
(c) (5 points) Describe how well the projection maps the points Xi to pi
. Specifically, compute the
average distance in the image plane (i.e., pixel locations) between the homogeneous points MXi and
2D image coordinates pi
, or
1
N
X
N
i
||proj(M, Xi) − pi
||2. (18)
(d) (5 points) Describe what relationship, if any, there is between Equations 18 and 6. Note that the
points we’ve given you are well-described by a linear projection – there’s no noise in the measurements
– but in practice, there will be an error that has to minimize. Both equations represent objectives that
could be used. If they are the same, show it; if they are not the same, report which one makes more sense
to minimize. Things to consider include whether the equations directly represent anything meaningful.
9
3 Estimation of the Fundamental Matrix and Reconstruction
Data: we give you a series of datasets that are nicely bundled in the folder task23/. Each dataset contains two images img1.png and img2.png and a numpy file data.npz containing a whole bunch of
variables. The script task23.py shows how to load the data.
Credit: temple comes from Middlebury’s Multiview Stereo dataset. The images shown in the synthetic
images are described in HW0’s credits.
Task 2: Estimating F [50 points]
(a) (20 points) Fill in find fundamental matrix in task23.py. You should implement the eightpoint algorithm. Remember to normalize the data (either the fancier approach works) or scaling by the
image size and centering at 0 is fine for our purposes, and to reduce the rank of F. You can compare with
cv2.findFundamentalMat(pts1,pts2,cv2.FM 8POINT), which should roughly match.
(b) (5 points) Report F (normalized so F2,2 = 1) for temple, ztrans, and xtrans.
(c) (10 points) Fill in compute epipoles. This should return the homogeneous coordinates of the
epipoles – remember they can be infinitely far away!
(d) (5 points) Show epipolar lines for temple, reallyInwards, and another dataset of your choice.
(e) (5 points) Report the epipoles for reallyInwards and xtrans.
Task 3: Triangulating X [30 points]
ztrans reallyInwards
Figure 2: Visualizations of reconstructions
The next step is extracting 3D points from 2D points and camera matrices, which is called triangulation. Let
X be a point in 3D.
p = M1X p0 = M2X (19)
10
Triangulation solves for X given p, p
0
,M1,M2. We’ll use OpenCV’s algorithms to do this. You can feel
free to use OpenCV’s Fundamental Matrix code if you’d like for this part.
1. (5 points) Compute and report the Essential Matrix E given the Fundamental Matrix F. You
should do this for the dataset reallyInwards. Recall that
F = K0−T EK−1
(20)
and that K, K0
are always invertible (for reasonable cameras), so you can compute E straightforwardly.
2. (15 points) Fill in find triangulation in task23.py.
The first camera’s projection matrix is K[I, 0]. The second camera’s projection matrix can be obtained
by decomposing E into a rotation and translation via cv2.decomposeEssentialMat. This
function returns two matrices R1 and R2 and a translation t. The four possible camera matrices for
M2 are:
M1
2 = K0
[R1, t], M2
2 = K0
[R1, −t], M3
2 = K0
[R2, t], M4
2 = K0
[R2, −t] (21)
You can identify which projection is correct by picking the one for which the most 3D points are
in front of both cameras. This can be done by checking for the positive depth, which can be done
by looking at the last entry of the homogeneous coordinate: the extrinsics put the 3D point in the
camera’s frame, where z < 0 is behind the camera, and the last row of K is [0, 0, 1] so this does not
change things.
Finally, triangulate the 2D points using cv2.triangulatePoints.
3. (10 points) Put a visualization of the point cloud for reallyInwards in your report. You can
use visualize pcd in utils.py or implement your own.
References and Credits
• Temple dataset used in Tasks 2 and 3: http://vision.middlebury.edu/mview/data/.
• Part of the homework are taken from Georgia Tech CS 6476 by James Hays and CMU 16-385. Please
feel free to similarly re-use our problems while similarly crediting us.
11