Apple Researchers Present ByteFormer, An AI Model that Consumes Only Bytes and Doesn’t Explicitly Model the Input Modality
Deep learning inferences are typically based on explicit modeling of input modality. Vision Transformers (ViTs), for example, directly model 2D spatial organisation of images by encoding image patches into vectors. Audio inference is often based on calculating spectral properties (like MFCCs), which are then transmitted into a network. As shown in Figure 1, a user must decode the file first into a modality specific representation (such an RGB tensor, or MFCCs), before inferring on a saved file (such as a JPEG audio or image file). Decoding inputs to a modality specific representation has two major downsides.
First, you must manually create an input representation for each input mode. Transformer backbones have been used in recent projects such as PerceiverIO, UnifiedIO. However, these techniques require modality-specific preprocessing. PerceiverIO, for example, decodes picture files before sending them to the network. PerceiverIO transforms other input modalities into different forms. The authors postulate that by executing inferences directly on filebytes, it is possible to eliminate modality-specific preprocessing. Decoding inputs to a modality specific representation has the second disadvantage that the material being analysed is exposed.
Imagine a smart gadget for your home that relies on RGB photos to make inferences. If an enemy has access to the model input, it could compromise the privacy of a user. They argue that deductions can be performed on inputs which protect privacy. To solve these problems, they note that many input modes can be saved as filebytes. They feed file bytes directly into their model during inference (Figure 1b), without decoding. They adopt a modified Transformer model to accommodate a variety of inputs and modalities.
Source: