This week, while setting up local project for work, I encountered some weired issue during the unit test and this has something to do with postgres and its default settings under Windows and other OS.
As an example, consider the following array:
[ 'D', 'd', 'a', 'A', 'c', 'b', 'CD', 'Capacitor' ]
>>> [ 'D', 'd', 'a', 'A', 'c', 'b', 'CD', 'Capacitor' ].sort()
Sorting this in Postgres SQL with default installation will yield a case insensitive sorting where upper and lower case are mixed:
SELECT regexp_split_to_table('D d a A c b CD Capacitor', ' ') ORDER BY 1;
A has an ASCII code of 32 where as
a has ASCII code of 97, so
For the result that Postgres gave, it is a bit more complex. Postgres uses
LC_COLLATE to determine the sort order of the array.This variable comes from the system and different OS have different implementation, when using locale
POSIX, strings are sorted according to their ASCII value, any other locales will result in a case insensitive result.
[ 'D', 'd', 'a', 'A', 'c', 'b', 'CD', 'Capacitor' ].sort((a,b) => a.localeCompare(b))
This uses localeCompare from string prototype which have some performance implication on larger array.
[ 'D', 'd', 'a', 'A', 'c', 'b', 'CD', 'Capacitor' ].sort(new Intl.Collator('en_us').compare)
This is the recommended by MDN to sort larger arrays.
For Postgres, the first thing that needs to be noted is that Postgres recommends against using locales if it can be avoided, from their documentation:
The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.
The one-off way to fix the sorting is by specifiying the
LC_COLLATE value when creating the database, for example:
CREATE DATABASE db
The created database (db in this case) will use
LC_COLLATE overriding the default
LC_COLLATE value from the OS. With the new database created, you easily verify it will sort in a case sensitive way by the ASCII value once you connect to the database and run the query presented previously.
This one-off way is good enough only if you care about creating such database once. Imaging next time you create a new database, you will still have to manually override the
LC_COLLATE value. So the way to go is to modify the template database, because
LC_COLLATE can’t be changed once the database has been created, we will have to create a new database and set it as template.
Another way to do this, is to initialize the database cluster with C locale as below:
chown -R postgres:postgres /var/lib/postgres/
this will create template database with C locale.
String sorting in PostgreSQL depends on the
LC_COLLATE setting of the table which depends on the setting of the operating system, default sorting will yield results that mixes upper case and lower case, in other words, sorting is not case sensitive. There are many ways to get case sensitive sorting, but the most reliable way should be specifiying the
LC_COLLATE when creating the database.