Interview to Dr. Marco Pedersoli after his presentation in our DaSCI’s seminars

10 May, 2021

We had the opportunity to interview Dr. Marco Pedersoli after his presentation in our DaSCI’s seminars. The title of his talk was “Efficient Deep Learning” where he introduced the most common families of approaches used to reduce the requirements of DL methods in terms of memory and computation for both training and deployment, and show how a reduction of the model footprint does not always produce a corresponding speed-up. He also showed some recent results that suggest that large DL models are important mostly for facilitating the model training, and when that is finished, a much smaller and faster model can be deployed with almost no loss in accuracy.

Dr. Marco Pedersol (http://profs.etsmtl.ca/mpedersoli/), is an Assistant Professor at ETS Montreal. He obtained his PhD in computer science in 2012 at the Autonomous University of Barcelona and the Computer Vision Center of Barcelona. Then, he was a postdoctoral fellow in computer vision and machine learning at KU Leuven with Prof. Tuytelaars and later at INRIA Grenoble with Drs. Verbeek and Schmid. At ETS Montreal he is a member of LIVIA and he is co-chairing an industrial Chair on Embedded Neural Networks for Connected Building Control. His research is mostly applied to visual recognition, the automatic interpretation and understanding of images and videos. His specific focus is on reducing the complexity and the amount of annotation required for deep learning algorithms such as convolutional and recurrent neural networks. Prof. Pedersoli has authored more than 40 publications in top-tier international conferences and journals in computer vision and machine learning.

{DaSCI} The first question we wanted to ask is related to GPU speed and low energy efficient devices. Would it be right to say that GPUs are just the opposite of Green Computing? Can GPUs be “green” some day not far away?

{Marco Pedersol} You’re right, from what I know, GPUs are not energy efficient at all. But at the same time, so far, they are the only available option to do computation on lots of data. I know that there are many, many different projects that are trying to solve this problem, that is, to be much more efficient than GPUs. And also to be able to work with sparse data, because one of the drawbacks of GPUs is that they can be efficient only when they have dense data. But with sparse data it’s a big problem. So, yes, so far, we are in an initial stage where we use GPUs because it is the only thing available. So with very large models, you can end up with a very expensive cost in terms of electricity, and then in terms of footprint. But now, many companies are trying to optimize GPUs for deep learning and this optimization is in terms, not only of computation, but also in terms of reducing the ecological footprint.

{DaSCI} Do you think that small research groups or groups with low budgets would have a hard time working with GPUs, that is, in comparison with enterprises or bigger research groups that have more hardware resources?

{MP} Yes, for sure, nowadays to be able to perform a good research we need, as always, good ideas, but we also need a lot of resources. That is one of the reasons why I chose to do research in Efficient Computing. With efficient computing, we can find methods that work with relatively low budget and low resources, thus enabling small companies or small research centers to perform deep learning and to train their own models. But at the same time, the more resources we have, the better. And for that I think it’s really important for a small center or group to associate with other groups or centers to be able to scale up, because if everyone gets busy with their own GPUs, it doesn’t really scale up. If you start to build a cluster of GPUs that could be shared among other people, the same resource can be used in a more efficient way because sometimes we have deadlines, and we have to use a lot of these resources. Some other times, other people have a deadline, and they‘ll use the same resources. So, it’s important to scale up and to put these resources together to be able to compete with the big companies and their large resources.
In Canada, for instance, it’s quite nice because they have what is called Compute Canada (https://www.computecanada.ca), where we have a big cluster of computers that every professor in Canada can have access to. There are two ways to access it. One option is to use the common access to everyone, where priority is based on how much you have already been using it and on how many people are currently using it. The other option it’s also even more interesting, that is, you can apply for resources. And then, if you win the grant, you will have some specific resources reserved for you and for your projects.

{DaSCI} GPUs have evolved over time, they have more and more computing power each year. It’s easier to have already trained models and later prune and retrain, or just buy another big GPU and plug it in with the others? Which one do you think is the future in this field?

{MP} That’s the thing about science, it depends a lot on what you actually want to do. If the aim is to use a model that has been already trained on other data and, maybe adapt it to your specific domain, approaches like pruning or distillation can be good ideas. But, if you really want to evaluate your models on a large data set, then there are also other techniques that can still reduce the computational cost of your model using better architectures. Training can be shorter even if you have to train with a lot of data. Then I will say there is no perfect solution, all of them have pros and cons.

{DaSCI} Normally we would say that if your problem has a dense representation you can work with GPUs but, if you have sparse representation then GPUs are not useful. But, nowadays, there is a trend in the research of deep neural networks at a hardware level. Could then CPUs be as efficient as GPUs?

{MP} Well, if you work at hardware level with CPUs that can do XORs directly at low level (using crossbars for instance) you can obtain very good speed ups. Then a network that normally will take, let’s say one second on GPU, may take the same time on CPU with these approximations… which is great. But the problem is that training at hardware level gets a bit more tricky, because it’s possible, but you need to take care of some additional things. Normally, for training, they still use floating point… instead of binary representation. So, when it’s about inference, you can be very fast on CPUs, but when it’s about training, it’s still difficult. Nevertheless, I believe it can be an interesting research topic to be able to train good binary nets, without using floating points, that will make it possible to train deep learning models directly on CPUs. But normally, the performance is a bit lower, but it’s still a great advantage.

{DaSCI} Binary nets are very useful for computer vision, but perhaps not applicable to other things, like recurrent networks. Then, how specific to a particular field are these kinds of architectures?

{MP} It depends a lot on the technique, but in general they’re quite, quite general. It’s basically the idea of machine learning, right? That you don’t want to care too much about the specific conditions of the problem, but you want to solve a general problem that can be applied to different domains. The techniques I presented in the talk were used for computer vision, since I do research mostly in this area. They make use of convolution neural networks and I would say that most of the techniques that I explained work very well for convolution neural networks, but they can also work with any other type of neural networks. And, yes, they would not be highly affected if the specific domain is changed. Of course, it depends on the kind of data yo have. Maybe if the input data is very sparse, one technique can be better than the other, but in general, all of them should work.

{DaSCI} When deep learning started to become fashionable, everybody was talking about it. But artificial neural networks had their moment years ago, and it’s now somehow in resurgence. Are we expecting deep learning to be with us for a long time? Or it will be just a fashion and something else will replace it? What are your thoughts about this topic?

{MP} From my perspective, I think you’re right, that even in research we have this kind of fashion trends. Some topics become very fashionable while others do not, and we should find ways not to do that. Because we need to be able to follow completely different directions if we really want to find new and interesting ideas. It’s important to make sure that not only the fashionable topics will receive money, but also other lines of research as well. But that said, I think that deep learning is going to stay, because we have seen that it’s not just about fashion, in the sense that it works well. It requires, of course, lots of data and lots of computation. But for this, there is also a lot of research trying to reduce the amount of data that we really need, and the amount of computation also. So in my opinion, it’s going to stay and it’s going to evolve. Because if we see what deep learning was 10 years ago, it’s very different from what it is now. So I think it’s going to evolve and stay in fashion for a long time.

{DaSCI} You mentioned in the talk that pruning makes networks more efficient, but you also mentioned that when using typical architectures (e.g., VGG, AlexNet) you could even delete almost 90% of all the weights keeping the same accuracy. We know that doing this will affect the efficiency, of course, and reduce the computation but… having less connections could also have another advantage: interpretability. Have you evaluated that?

{MP} What you’re saying makes sense, but it will always depend on the order that we are talking about. For instance, if we have an order of 10 weights or 10 neurons, we can probably check them manually. If it’s on the order of 1000s or millions, it will be impossible. But yes, I haven’t thought about that, it could be exploited in some way to understand a bit more what is happening inside the neural network.
If you have a good way for pruning it means that you have a probability of the importance of each feature and maybe with that, you can even select the few that have the highest probability or highest magnitude and then check them: What do they represent?

{DaSCI} Nowadays, people are discussing ethics and biases in AI. To do so, we may need to incorporate small algorithms or filters in our code to check or to work more ethically. Could this “overhead” slow down the development of new algorithms? Do you believe researchers will adapt their work to gain transparency?

{MP} I will say it’s a bit like driving, right? We always want to drive as fast as possible. But at the same time it gets dangerous. So it’s good to have rules and speed limits that tell us how fast we can go. And so I think it’s important to be ethical and avoid biases. So even if it could make our methods a bit slower, I think it should be employed anyway. Because, as I said, when we drive, it’s more important not to have an accident and to get to the final destination safely.

{DaSCI} Also power consumption and carbon footprint are beginning to be reported in the literature when working with deep learning. Do you think these will be a key feature in future research papers?

{MP} Yes, I think it’s important for many reasons. First, to at least give an idea and an understanding of what has been done. Also, in terms of, as you said, carbon footprint, because especially big corporations can reach certain levels of power consumption that do not make any sense. For instance, optimizing a model to be able to perform inference 1.5 times faster than before may waste millions of dollars to try all the possible configurations that lead to this improvement. So, it’s important to see the full picture and not only the final result. We need to see how they got there, how much computation they had to do, because it’s unfair, not only in terms of ethics, but also in terms of comparing the work of small labs with big labs.
It’s a bit like when you produce some goods. It’s not just about the cost of producing them, but you should also consider the cost of eliminating them. Sometimes companies don’t care much about the materials that they use for certain products because then, the elimination of these products wouldn’t be part of their cost, it will be part of the cost of the community. So they try to optimize and even if they use materials that pollute more, they don’t care much because they care just about their profit and not the cost of the community to eliminate the pollutants. Maybe the analogy is a bit far fetched, but you get the idea.

We would like to thank Dr. Pedersoli for the extra time he spent with us after the seminar and we hope his insight about the topics we discussed are as interesting to the readers as they were to us.