We consider the problem of performing linear regression over a stream of $d$-dimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints. Specifically, consider a sequence of labeled examples $(a_1,b_1), (a_2,b_2)ldots,$ with $a_i$ drawn independently from a $d$-dimensional isotropic Gaussian, and where $b_i = langle a_i, xrangle + eta_i,$ for a fixed $x in mathbb{R}^d$ with $|x|_2 = 1$ and with independent noise $eta_i$ drawn uniformly from the interval $[-2^{-d/5},2^{-d/5}].$ We show that any algorithm with at most $d^2/4$ bits of memory requires at least $Omega(d log log frac{1}{epsilon})$ samples to approximate $x$ to $ell_2$ error $epsilon$ with probability of success at least $2/3$, for $epsilon$ sufficiently small as a function of $d$. In contrast, for such $epsilon$, $x$ can be recovered to error $epsilon$ with probability $1-o(1)$ with memory $Oleft(d^2 log(1/epsilon)right)$ using $d$ examples. This represents the first nontrivial lower bounds for regression with super-linear memory, and may open the door for strong memory/sample tradeoffs for continuous optimization.