ترغب بنشر مسار تعليمي؟ اضغط هنا

We revisit and extend model stitching (Lenc & Vedaldi 2015) as a methodology to study the internal representations of neural networks. Given two trained and frozen models $A$ and $B$, we consider a stitched model formed by connecting the bottom-layer s of $A$ to the top-layers of $B$, with a simple trainable layer between them. We argue that model stitching is a powerful and perhaps under-appreciated tool, which reveals aspects of representations that measures such as centered kernel alignment (CKA) cannot. Through extensive experiments, we use model stitching to obtain quantitative verifications for intuitive statements such as good networks learn similar representations, by demonstrating that good networks of the same architecture, but trained in very different ways (e.g.: supervised vs. self-supervised learning), can be stitched to each other without drop in performance. We also give evidence for the intuition that more is better by showing that representations learnt with (1) more data, (2) bigger width, or (3) more training time can be plugged in to weaker models to improve performance. Finally, our experiments reveal a new structural property of SGD which we call stitching connectivity, akin to mode-connectivity: typical minima reached by SGD can all be stitched to each other with minimal change in accuracy.
225 - Boaz Barak , Kunal Marwaha 2021
We study the performance of local quantum algorithms such as the Quantum Approximate Optimization Algorithm (QAOA) for the maximum cut problem, and their relationship to that of classical algorithms. (1) We prove that every (quantum or classical) o ne-local algorithm achieves on $D$-regular graphs of girth $> 5$ a maximum cut of at most $1/2 + C/sqrt{D}$ for $C=1/sqrt{2} approx 0.7071$. This is the first such result showing that one-local algorithms achieve a value bounded away from the true optimum for random graphs, which is $1/2 + P_*/sqrt{D} + o(1/sqrt{D})$ for $P_* approx 0.7632$. (2) We show that there is a classical $k$-local algorithm that achieves a value of $1/2 + C/sqrt{D} - O(1/sqrt{k})$ for $D$-regular graphs of girth $> 2k+1$, where $C = 2/pi approx 0.6366$. This is an algorithmic version of the existential bound of Lyons and is related to the algorithm of Aizenman, Lebowitz, and Ruelle (ALR) for the Sherrington-Kirkpatrick model. This bound is better than that achieved by the one-local and two-loc
We propose a notation for tensors with named axes, which relieves the author, reader, and future implementers from the burden of keeping track of the order of axes and the purpose of each. It also makes it easy to extend operations on low-order tenso rs to higher order ones (e.g., to extend an operation on images to minibatches of images, or extend the attention mechanism to multiple attention heads). After a brief overview of our notation, we illustrate it through several examples from modern machine learning, from building blocks like attention and convolution to full models like Transformers and LeNet. Finally, we give formal definitions and describe some extensions. Our proposals build on ideas from many previous papers and software libraries. We hope that this document will encourage more authors to use named tensors, resulting in clearer papers and less bug-prone implementations. The source code for this document can be found at https://github.com/namedtensor/notation/. We invite anyone to make comments on this proposal by submitting issues or pull requests on this repository.
We prove a new upper bound on the generalization gap of classifiers that are obtained by first using self-supervision to learn a representation $r$ of the training data, and then fitting a simple (e.g., linear) classifier $g$ to the labels. Specifica lly, we show that (under the assumptions described below) the generalization gap of such classifiers tends to zero if $mathsf{C}(g) ll n$, where $mathsf{C}(g)$ is an appropriately-defined measure of the simple classifier $g$s complexity, and $n$ is the number of training samples. We stress that our bound is independent of the complexity of the representation $r$. We do not make any structural or conditional-independence assumptions on the representation-learning task, which can use the same training dataset that is later used for classification. Rather, we assume that the training procedure satisfies certain natural noise-robustness (adding small amount of label noise causes small degradation in performance) and rationality (getting the wrong label is not better than getting no label at all) conditions that widely hold across many standard architectures. We show that our bound is non-vacuous for many popular representation-learning based classifiers on CIFAR-10 and ImageNet, including SimCLR, AMDIM and MoCo.
We give an algorithm for solving unique games (UG) instances whenever low-degree sum-of-squares proofs certify good bounds on the small-set-expansion of the underlying constraint graph via a hypercontractive inequality. Our algorithm is in fact more versatile, and succeeds even when the constraint graph is not a small-set expander as long as the structure of non-expanding small sets is (informally speaking) characterized by a low-degree sum-of-squares proof. Our results are obtained by rounding emph{low-entropy} solutions -- measured via a new global potential function -- to sum-of-squares (SoS) semidefinite programs. This technique adds to the (currently short) list of general tools for analyzing SoS relaxations for emph{worst-case} optimization problems. As corollaries, we obtain the first polynomial-time algorithms for solving any UG instance where the constraint graph is either the emph{noisy hypercube}, the emph{short code} or the emph{Johnson} graph. The prior best algorithm for such instances was the eigenvalue enumeration algorithm of Arora, Barak, and Steurer (2010) which requires quasi-polynomial time for the noisy hypercube and nearly-exponential time for the short code and Johnson graphs. All of our results achieve an approximation of $1-epsilon$ vs $delta$ for UG instances, where $epsilon>0$ and $delta > 0$ depend on the expansion parameters of the graph but are independent of the alphabet size.
The linear cross-entropy benchmark (Linear XEB) has been used as a test for procedures simulating quantum circuits. Given a quantum circuit $C$ with $n$ inputs and outputs and purported simulator whose output is distributed according to a distributio n $p$ over ${0,1}^n$, the linear XEB fidelity of the simulator is $mathcal{F}_{C}(p) = 2^n mathbb{E}_{x sim p} q_C(x) -1$ where $q_C(x)$ is the probability that $x$ is output from the distribution $C|0^nrangle$. A trivial simulator (e.g., the uniform distribution) satisfies $mathcal{F}_C(p)=0$, while Googles noisy quantum simulation of a 53 qubit circuit $C$ achieved a fidelity value of $(2.24pm0.21)times10^{-3}$ (Arute et. al., Nature19). In this work we give a classical randomized algorithm that for a given circuit $C$ of depth $d$ with Haar random 2-qubit gates achieves in expectation a fidelity value of $Omega(tfrac{n}{L} cdot 15^{-d})$ in running time $textsf{poly}(n,2^L)$. Here $L$ is the size of the emph{light cone} of $C$: the maximum number of input bits that each output bit depends on. In particular, we obtain a polynomial-time algorithm that achieves large fidelity of $omega(1)$ for depth $O(sqrt{log n})$ two-dimensional circuits. To our knowledge, this is the first such result for two dimensional circuits of super-constant depth. Our results can be considered as an evidence that fooling the linear XEB test might be easier than achieving a full simulation of the quantum circuit.
Type-two constructions abound in cryptography: adversaries for encryption and authentication schemes, if active, are modeled as algorithms having access to oracles, i.e. as second-order algorithms. But how about making cryptographic schemes themselve s higher-order? This paper gives an answer to this question, by first describing why higher-order cryptography is interesting as an object of study, then showing how the concept of probabilistic polynomial time algorithm can be generalized so as to encompass algorithms of order strictly higher than two, and finally proving some positive and negative results about the existence of higher-order cryptographic primitives, namely authentication schemes and pseudorandom functions.
We prove that every key agreement protocol in the random oracle model in which the honest users make at most $n$ queries to the oracle can be broken by an adversary who makes $O(n^2)$ queries to the oracle. This improves on the previous $widetilde{Om ega}(n^6)$ query attack given by Impagliazzo and Rudich (STOC 89) and resolves an open question posed by them. Our bound is optimal up to a constant factor since Merkle proposed a key agreement protocol in 1974 that can be easily implemented with $n$ queries to a random oracle and cannot be broken by any adversary who asks $o(n^2)$ queries.
We show that every construction of one-time signature schemes from a random oracle achieves black-box security at most $2^{(1+o(1))q}$, where $q$ is the total number of oracle queries asked by the key generation, signing, and verification algorithms. That is, any such scheme can be broken with probability close to $1$ by a (computationally unbounded) adversary making $2^{(1+o(1))q}$ queries to the oracle. This is tight up to a constant factor in the number of queries, since a simple modification of Lamports one-time signatures (Lamport 79) achieves $2^{(0.812-o(1))q}$ black-box security using $q$ queries to the oracle. Our result extends (with a loss of a constant factor in the number of queries) also to the random permutation and ideal-cipher oracles. Since the symmetric primitives (e.g. block ciphers, hash functions, and message authentication codes) can be constructed by a constant number of queries to the mentioned oracles, as corollary we get lower bounds on the efficiency of signature schemes from symmetric primitives when the construction is black-box. This can be taken as evidence of an inherent efficiency gap between signature schemes and symmetric primitives.
We give a quasipolynomial time algorithm for the graph matching problem (also known as noisy or robust graph isomorphism) on correlated random graphs. Specifically, for every $gamma>0$, we give a $n^{O(log n)}$ time algorithm that given a pair of $ga mma$-correlated $G(n,p)$ graphs $G_0,G_1$ with average degree between $n^{varepsilon}$ and $n^{1/153}$ for $varepsilon = o(1)$, recovers the ground truth permutation $piin S_n$ that matches the vertices of $G_0$ to the vertices of $G_n$ in the way that minimizes the number of mismatched edges. We also give a recovery algorithm for a denser regime, and a polynomial-time algorithm for distinguishing between correlated and uncorrelated graphs. Prior work showed that recovery is information-theoretically possible in this model as long the average degree was at least $log n$, but sub-exponential time algorithms were only known in the dense case (i.e., for $p > n^{-o(1)}$). Moreover, Percolation Graph Matching, which is the most common heuristic for this problem, has been shown to require knowledge of $n^{Omega(1)}$ seeds (i.e., input/output pairs of the permutation $pi$) to succeed in this regime. In contrast our algorithms require no seed and succeed for $p$ which is as low as $n^{o(1)-1}$.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا