Error: 'utf-8' Codec Can't Decode Byte 0xb0 In Position 0: Invalid Start Byte In Google Colab

September 27, 2023 Post a Comment

import PyPDF4 from google.colab import files files.upload() fileReader = PyPDF4.PdfFileReader('ITC-1.pdf') s='' for i in range(2, fileReader.numPages): s+=fileReader.getPage(i)

Solution 1:

import pdfplumber
from tensorflow.keras.layers.experimental import preprocessing
import tensorflow as tf

f = open('test.txt', 'w')

with pdfplumber.open(r'test.pdf') as pdf:
    for page in pdf.pages:
      f.write(page.extract_text())
f.close()
layer = preprocessing.TextVectorization()
text_ds = tf.data.TextLineDataset('test.txt').filter(lambda x: tf.cast(tf.strings.length(x), bool))

layer.adapt(text_ds.batch(1024))
inverse_vocab = layer.get_vocabulary()

You could do something like this:

read pdf using pdfplumber.
Write the pages to a text file.
Then create dataset using that text file.

Baca Juga

Tensorflow No Module Named Example.tutorials.mnist.input_data
Pixelate Roi Bounding Box And Overlay It On Original Image Using Opencv
Valueerror: Unknown Layer: Functional

Learn Python Tutorials

Error: 'utf-8' Codec Can't Decode Byte 0xb0 In Position 0: Invalid Start Byte In Google Colab

Solution 1:

Post a Comment for "Error: 'utf-8' Codec Can't Decode Byte 0xb0 In Position 0: Invalid Start Byte In Google Colab"