Akira’s Machine Learning News — Issue #35
Featured Paper/News in This Week.
- A method is proposed to mask the image and pre-train the model to recover it, like BERT. 75% of the image is masked and only 25% of the unmasked image is input to the encoder, which seems to be memory friendly.
- An image generation model that is a combination of the diffusion model and the mask language model has been presented. It seems to be able to adjust the quality of the generation while adapting it to the computational resources at hand.
— — — — — — — — — — — — — — — — — — –
In the following sections, I will introduce various articles and papers not only on the above contents but also on the following five topics.
- Featured Paper/News in This Week
- Machine Learning Use Case
- Papers
- Articles related to machine learning technology
— — — — — — — — — — — — — — — — — — –
1. Featured Paper/News in This Week
Learning ViT by hiding the image with a mask and restoring it — arxiv.org
[2111.06377] Masked Autoencoders Are Scalable Vision Learners
They propose MAE (Masked Autoencoders) that can achieve 87.8% on ImageNet alone by masking images and restoring them through self-supervised learning, even though it uses ViT-based models. The proposed method hides most of the image (e.g., 75%) and learns to restore it, and shows higher performance than existing self-supervised learning methods such as DINO and MoCo v3.
A model that combines autoregressive and diffusion models — arxiv.org
[2110.02037] Autoregressive Diffusion Models
Proposed ARDMs (AutoreRressive Diffusion Models), a combination of autoregressive and diffusion models. Unlike autoregressive models, which regress from top-left to bottom-right sequentially, ARDMs are trained to reproduce randomly selected points from the input, which may be similar to BERT’s masked language models.
— — — — — — — — — — — — — — — — — — –
2. Machine Learning use case
Protecting Sexual Minorities with Voice Conversion Using Deep Fake — www.wired.com
Transgender people can be harassed for voice and gender mismatch, but using Deep Fake for voice conversion can prevent such harassment. This will make it easier for sexual minorities to participate in the online community, which has been difficult for them to do so.
Crossing Language Barriers with Multilingual Model Chatbots — venturebeat.com
This is an introduction to a chatbot developed by Moveworks that uses a multilingual language model. In a global company, this means that people who speak different languages can receive support without having to set up support centers in different countries.
— — — — — — — — — — — — — — — — — — –
3. Machine Learning Papers
Validating table data with various deep learning models — arxiv.org
[2106.11959] Revisiting Deep Learning Models for Tabular Data
A study that tested various deep learning models on table data, showing that ResNet-based models are strong and that FT-Transformer, which tokenizes features, is a good baseline, but is not significantly superior to GBDT-based methods.
›
Tip-Adapter for Few-shot learning without learning — arxiv.org
[2111.03930] Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
By improving the CLIP-Adapter, they proposed a Tip-Adapter that performs few-shot learning without updating parameters. The similarity between the test image and the Few-shot dataset is measured, and the output of the category is based on the similarity and the text information.
A method that can be directly applied to existing object detection methods to improve the accuracy. — arxiv.org
[2111.03056] Bootstrap Your Object Detector via Mixed Training
Proposes MixTraining, which replaces existing GT labels with high confidence prediction in object detection, and controls the strength of data augmentation depending on the difficulty of the sample. It can be directly applied to existing object detection methods to improve the accuracy.
Patching is the cause of learning instability in ViT. — arxiv.org
[2106.14881] Early Convolutions Help Transformers See Better
ViT is less stable in learning than CNNs, but the authors argued that the reason for this is the patching of the initial layer. By replacing the initial 16x16 patching with regular Conv combined with 3x3 Conv, etc., ViT becomes robust to fluctuations in learning rate, converges faster, and outperforms the SotA model of CNNs.
AugMax data augmentations to learn diversity and high difficulty samples. — arxiv.org
[2110.13771] AugMax: Adversarial Composition of Random Augmentations for Robust Training
Proposed AugMax, which searches for more powerful data augmentations by using learning parameters for mixing data augmentations. DuBIN is also proposed to separate individual and batch level diversity with Instance Norm and BatchNorm because it is too difficult to learn. The authors claim it can learn diversity and high difficulty samples.
Comparison of Robustness between CNN and Transformer — arxiv.org
[2111.05464] Are Transformers More Robust Than CNNs?
Transformer was said to be more robust than CNN, but when training methods such as training data and data augmentations are aligned, CNN can acquire the same level of robustness against adversarial attacks as Transformer. However, for outlier data such as ImageNet-A and -C, Transformer was stronger.
— — — — — — — — — — — — — — — — — — –
4. Technical Articles
What to watch out for in a project using data science techniques — towardsdatascience.com
An article about what to watch out for in a project using data science techniques. It discusses unorganized data and conflicts with stakeholders.
— — — — — — — — — — — — — — — — — — –
🌟I post weekly newsletters! Please subscribe!🌟
— — — — — — — — — — — — — — — — — — –
Other Blogs
— — — — — — — — — — — — — — — — — — –
About Me
Manufacturing Engineer/Machine Learning Engineer/Data Scientist / Master of Science in Physics / http://github.com/AkiraTOSEI/
Twitter, I post one-sentence paper commentary.