问题

试图在谷歌协作上使用 tensorflow 时无法识别图像文件


我正在使用 google colaboratory 和 tensorflow 来训练一个神经网络,对狗和猫的图像进行分类。我使用 model.fit_generator 对我的数据进行训练。我的数据加载得很好,但是当它在读取图像的一些时期后开始迭代验证步骤时,我得到了标题中描述的以下错误:

PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7f347160a0f8>

我使用的猫和狗的图片是从 kaggle 下载的

我在 jupyter 笔记本上看到过一些使用 PIL 处理单张图片的解决方案,但是考虑到 google collab 隐式地使用了 PIL,我该如何处理 google collab 上每张图片的这个错误呢?

这是我的代码的实例

from google.colab import files
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras import backend as K
import numpy as np
from keras.preprocessing import image

from google.colab import drive
drive.mount('/content/drive')

img_width, img_height = 150, 150

train_data_dir = '/content/drive/My Drive/data/train'
validation_data_dir = '/content/drive/My Drive/data/validation'
nb_train_samples = 1000
nb_validation_Samples = 100
epochs = 10
batch_size = 20


if K.image_data_format() == 'channels_first':
   input_shape = (3, img_width, img_height)
else:
   input_shape = (img_width, img_height, 3)

train_datagen = ImageDataGenerator(
    rescale= 1. / 255,
    shear_range = 0.2,
    zoom_range=0.2,
    horizontal_flip=True
)

test_datagen = ImageDataGenerator(rescale=1. / 255)

train_generator = train_datagen.flow_from_directory(
   train_data_dir,
   target_size=(img_width, img_height),
   batch_size=batch_size,
   class_mode='binary')

model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.summary()


model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))


model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))


model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))


model.summary()

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data = validation_generator,
    validation_steps = nb_validation_Samples // batch_size)


validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode="binary")

错误本身发生在这一点:

   model.fit_generator(
        train_generator,
        steps_per_epoch=nb_train_samples // batch_size,
        epochs=epochs,
        validation_data = validation_generator,
        validation_steps = nb_validation_Samples // batch_size)

具体来说,在这一行:

validation_steps = nb_validation_Samples // batch_size)
推荐答案

如果数据集是从 Microsoft 下载的,您可以使用下面的脚本来清除它。正如评论所表明的那样,这个脚本在很大程度上是从另一个 SO 话题中被采纳的。

#!/usr/bin/env python
# https://stackoverflow.com/questions/63754311/unidentifiedimageerror-cannot-identify-image-file
# 1st in the answers

import os
from PIL import Image

folder_path = r'raw\PetImages'
extensions = []
for fldr in os.listdir(folder_path):
    sub_folder_path = os.path.join(folder_path, fldr)
    for filee in os.listdir(sub_folder_path):
        file_path = os.path.join(sub_folder_path, filee)
        print('** Path: {}  **'.format(file_path), end="\r", flush=True)
        try:
            im = Image.open(file_path)
            rgb_im = im.convert('RGB')
            if filee.split('.')[1] not in extensions:
                extensions.append(filee.split('.')[1])
        except:
            print("\nWrong format file:                       ", file_path, flush=True)

print("\nValid extensions: ", repr(extensions))

'''
** Path: raw\PetImages\Cat\666.jpg  **
Wrong format file:                        raw\PetImages\Cat\666.jpg
** Path: raw\PetImages\Cat\Thumbs.db  **
Wrong format file:                        raw\PetImages\Cat\Thumbs.db
** Path: raw\PetImages\Dog\11702.jpg  **
Wrong format file:                        raw\PetImages\Dog\11702.jpg
** Path: raw\PetImages\Dog\9057.jpg  **D:\penv38\lib\site-packages\PIL\TiffImagePlugin.py:811: UserWarning: Truncated File Read
  warnings.warn(str(msg))
** Path: raw\PetImages\Dog\Thumbs.db  **
Wrong format file:                        raw\PetImages\Dog\Thumbs.db

Valid extensions:  ['jpg']

Thus exclude these files: 
    Cat\666.jpg
    Dog\11702.jpg
    Dog\9057.jpg
'''