MySQL collations - why are there so much problems with them?

Do you share my feeling that any problem with encoding in MySQL database wastes a lot of your time? A few different settings, multiple possible combinations, you tried many of them and it is still not working? Maybe one of them was correct but should I restart MySQL instance, the whole server, the data center? Aaaaargh.

 

 

Weird characters instead of proper national characters

That is a common symptom of wrong collation. You try to add a text like Zwariuję walcząc na pięści z MySQLem via your application but when it is saved, the application shows something different - Zwariuj? walcz?c na pi??ci z MySQLem. Sometimes national characters are replaced with question marks, sometimes with something else but the displayed text is not the same as the one that you typed in.

I have gathered here the most important information that will help you chase the enemy and defeat it.

 

MySQL collation variables

There are three important variables. Their current values can be checked by running:

show variables like '%collation%';

variables

These are the final settings that work for me. Before you change them, read what they actually mean.

collation_database
It defines a collation for the whole database. Every new table that will be created will use this collation. Generally, you should define this setting because it is rather frequent case to add new tables to a database.
collation_server
It defines a collation on the whole database server. It is good to set this variable but it is not so important. It will be used for creating new databases which does not happen often.
collation_connection
This collation will be used for connections between the database server and a client unless the client sets it first. You should set it anyway.

These variables can be set by switching to a particular database and running:

set collation_connection = 'utf8_general_ci';
set collation_database = 'utf8_general_ci';

If you change them on a database that already contains some tables, you should know that each table has it's own collation setting which is inherited from collation_database at the time of the table creation. If during the table creation your database had incorrect setting, you have to update it for each table:

ALTER TABLE `dbname`.`TABLE_NAME`
COLLATE = utf8_general_ci;

MySQL also has other variables and collation settings but I do not consider them so important. I usually do not have to change them so I do not want to introduce additional complexity to the encoding problem in MySQL. It is already enough complex and frustrating.

Reading the documentation you will also notice charset variables. I do not focus on them because a charset value is a part of a collation setting. So if you define collation properly, a charset is also defined.

 

Available collations

A list of available collations can be found in MySQL documentation. Choose the right one that suits best languages that your database and application should support. Let's take a look at some examples.

latin1_swedish_ci - the name consists of three parts:

  • latin1 is a charset,
  • swedish is a local variation of latin1 - it means that the collation supports swedish diactric characters,
  • ci stands for Case Insensitive, it can be cs also which stands for Case Sensitive - it determines how to compare texts.

I use utf8_general_ci which is UTF-8 charset with case insensitive way of comparing texts.

 

Encoding on the client side

It happens that even after configuring MySQL according to the above instruction, some characters that were inserted by a client application are incorrectly displayed in the same client application. In such case, you should try doing the same with MySQL Workbench using INSERT and SELECT statements. If it work correctly with MySQL Workbench but it does not with your application, there is a high probability that your application uses wrong encoding and sets wrong encoding on a connection to the database.

For example, I have an application written in Java so it natively supports Unicode. Everything should be fine but it is not. There is still a chance that the connection uses wrong encoding. I have never had to do this for any other database engine, but with MySQL I have to - set encoding on the datasource. I add useUnicode and characterEndcoding property to the JDBC connection string.

jdbc:mysql://localhost:3306/dbname?useUnicode=yes&characterEncoding=UTF-8

It enforces JDBC driver to use UTF-8 for connections with the database.

 

The above steps and information usually is enough for me to make the system (MySQL + client application) support national characters. Good luck with your case!