Note: This post is part 3 in a series. You can find the inaugural post here.
The last post gave an overview of quantum kernels – one particular quantum-enhanced machine learning algorithm – and discussed some of their promises and limitations. In this post, I expand on the latter. This is not intended to dampen interest in the topic, but rather to give a balanced view of the matter. Indeed, in the last post we saw 2 limitations already:
Purely-classical kernels may be just as good as quantum ones.
The quantum compute runtime for calculating kernels might get very large, very quickly, for large data sets.
This post focus on 3 additional limitations:
Various factors can cause kernel values to ‘collapse’ and become extremely hard to distinguish.
Striving to tailor a kernel to a given data set may render it easy to approximate classically.
Transferring large data sets to/from a quantum computer may slow down the overall runtime of the workflow.
The second one in this list can be considered an extension or deeper dive of the first one in the list from the prior post.
Let’s dive into each below.
Ensuring sufficiently-accurate estimation of kernel values may require burdensome resources.
Recall that to calculate a quantum kernel, you need to run a quantum circuit. In the limit of both an ideal quantum computer and an infinite number of repetitions of the circuit, you’d be able to calculate the kernel value exactly. In practice, quantum computers are noisy, and you can only run the circuit a finite number of times – hence, what you end up with is a noisy estimate of the kernel value. Depending on how much noise there is in these estimates, there might be poor downstream consequences for your model.
In particular, it might not be possible to resolve, with sufficient accuracy, two distinct kernel values for two different pairs of data points. That is, if K1 and K2 are the actual (ideal) kernel values, you might end up with estimates K1’ and K2’ such that K1’ and K2’ are closer to one another than K1 and K2. Conversely, in order to resolve K1’ and K2’ to sufficient accuracy (say, close to the difference between K1 and K2), you may need a large number of repetitions of the circuit.
It turns out other factors may also inhibit your ability to resolve distinct kernel values. In Exponential concentration and untrainability in quantum kernel methods, it is shown such factors include the presence of entanglement in the circuit, defining a kernel value through measuring all the qubits in the circuit, using highly-expressible circuits, and noise. All of these can cause kernel values to concentrate exponentially. (See Figure 1.) This means you’d need to use an exponential number of repetitions of the circuit to resolve distinct kernel values.
Getting around this problem is going to require more work to find families of circuits which avoid these ‘kernel collapse’ factors, while also yielding useful models.
Tailoring quantum kernels to data sets may render them classically-easy.
Here, “tailoring” can mean “Judiciously choosing the encoding circuit” and/or “Endowing the circuit with a tunable hyperparameter which you optimize over.”. Doing too much tailoring may result in a kernel function which may be easily-approximable classically (in the sense of the paper The Power of Data in Quantum Machine Learning from last time). The trick, then, seems to be finding both data sets and tailoring methods for which doing the tailoring actually helps, but without tipping the quantum-enhanced model into a place where you could just as easily use a purely-classical one.
The key idea in all of this is that the encoding circuit defines the kernel function, and that kernel function can, at least formally, be written down as a sum of terms. (For example, you can use a Fourier series to decompose it.) It turns out that you can rigorously study how the structure of the encoding circuit dictates the terms in the decomposition (as a Fourier series); The effect of data encoding on the expressive power of variational quantum machine learning models explains how.
What’s interesting about that work is it shows the parts of the circuit with any tunable parameters (not the parameters used to encode the data) control the amplitudes of the terms in the decomposition. (The parts of the circuit which actually do the data encoding control the trigonometric functions which show up in the decomposition.) In turn, this implies the set of functions you can express using a given circuit depends quite strongly on those tunable parameters and how they enter the circuit.
A related idea is presented in the video Bandwidth Enables Generalization in Quantum Kernel Models, which discusses how introducing such hyperparameters in the kernel impacts the ability of the quantum-enhanced model to generalize well. The paper appears here.
One way to tune these parameters is through kernel training/alignment (an idea from classical ML).
Watch Lecture 7.1 - Quantum Kernels in Practice for a discussion of kernel training/alignment in the context of data sets with a special mathematical structure (namely, groups).
The paper: Covariant Quantum Kernels for Data With Group Structure
Supplemental Lecture Notes: https://learn.qiskit.org/content/summer-school/2021/resources/lecture-notes/Lecture7.1.pdf
Although these ideas of tailoring kernels to data seem promising, there are a couple of downsides:
It’s not obvious how to find problems where you can use knowledge of the kind of data being processed to design a good quantum circuit. See, for example, The Inductive Bias of Quantum Kernels, which concludes its abstract with
“...we conjecture that quantum machine learning models can offer speed-ups only if we manage to encode knowledge about the problem at hand into quantum circuits, while encoding the same bias into a classical model would be hard. These situations may plausibly occur when learning on data generated by a quantum process, however, they appear to be harder to come by for classical datasets.”
Numerical work on real-world data sets suggests that (as measured by the geometric difference quantity introduced in the last post), scaling the number of qubits makes the kernel easier and easier to approximate classically. What’s more, scaling the number of data points used doesn’t seem to help. See Numerical evidence against advantage with quantum fidelity kernels on classical data.
The lines of work presented here suggest to me that more work needs to be done to (a) put learning using quantum kernels on a firmer foundation from a theory of machine learning standpoint (i.e., we need more proofs about what function(s) can and cannot be advantageously learned using quantum kernels), and (b) study both real-world data sets and a broader class of encoding circuits to help the community get a handle, empirically, on the strengths and limitations of quantum kernels for non-ad-hoc data.
Data transfer costs may slow down the overall workflow.
In the last post on quantum kernels, we discussed how the amount of quantum compute runtime needed to estimate all the kernel values for a given data set could become prohibitively large as the data set size increased. There, the issue is that the speed with which kernel values can be calculated is a bottleneck.
Hence, finding ways to somehow minimize the amount of data which needs to actually be processed using a quantum computer would be helpful. This would also be helpful for another reason: access to quantum computers is commonly provided via the cloud. So if you have a large data set you want to process, you need to ship it. Although the pipes of the internet are big, you might still run into latency issues trying to move data. So again, finding ways to minimize the amount of data transferred would be ideal.
At the risk of seeming self-serving, I did a project looking at a way to deal with this. The idea is to use a quantum computer to explicitly estimate some of the kernel values, and then use (classical) matrix completion to infer the rest. This approach had the advantage that we could bound the minimal number of data points which needed to be sent to the quantum computer. But, it turned out that number was basically “ship all the data, all the time”. That said, numerical experiments on real-world data suggested we might be OK shipping an amount of data which was less than the minimum, at the expense of some error.
Watch Kernel Matrix Completion for Offline Quantum-Enhanced Machine Learning, presented by the lead author, Annie Naveh, as part of Quantum Techniques in Machine Learning 2021.
I gave a talk at the Quantum Artificial Intelligence (QAI) workshop, co-located with IEEE Quantum Week 2021 (QCE21). Slides are here.
The paper: Kernel Matrix Completion for Offline Quantum-Enhanced Machine Learning
For what it’s worth, I would be really interested if someone coded up this method and did a sandbox-like simulation to study how the overall runtime of this approach scales with data set size, the amount of data transferred, various assumptions on delays/latencies, and the completion algorithm.
Thus concludes our foray into the topic of quantum kernels. In the next installment in this series, we’ll take a look at quantum neural networks.
By the way, how are you enjoying this series? Please leave a comment and let me know what’s going well, and what can be improved!