If you want to read the Traditional Chinese version of this article, click the link "正體中文" in the footer.
About the background of this article, it's a long story...
Eight years ago, I went to CTSC (IOI China Team Selection Competition) to play (not the training team, just to play). During those days, I received a lot of training from many big guys. The most impressive one was BYVoid, because he put a link to his website in the PPT. When I clicked in, I found that I could switch between "Simplified Chinese" and "Traditional Chinese" at will. It seemed different from the internationalization I had seen before, the content of the article would also change with the switch. At that time, I knew nothing about web development, so I didn't have the ability to study how it was implemented.
Later by chance, I saw a project called OpenCC (coincidentally, the author is BYVoid). This project can convert Chinese characters between three modes in mainland China, Hong Kong, and Taiwan. It does a good job of converting specialized vocabulary and regional idioms, and it is already a very common library under Linux. Later I learned that BYVoid's website was converted using OpenCC.
At that time, I tried to call OpenCC's functions using PHP, but due to the large workload (plus I was a newbie), I didn't succeed. Later, I thought why not call OpenCC directly in Nginx to convert the page and return it to the browser? I checked it out, although Nginx itself does not have the ability to call external functions, OpenResty (essentially Nginx+Lua) seems to be able to. However, I was not familiar with Nginx at that time, I felt that switching to OpenResty would be a lot of trouble, so I gave up.
One day last year, I migrated my website to a server with better hardware configuration and fully Dockerized it (I also used Docker for local development, just with another configuration file). Switching from Nginx to OpenResty was as easy as changing clothes; coupled with the fact that I felt my level was good enough now, I started another attempt.
After some searching, I found that someone had successfully used OpenResty's Lua FFI to call OpenCC's .so library, but because it was done for paying customers, it was not convenient to open source the code. Although it was a bit regrettable, I thought that since there was a feasible idea, I could definitely make it work.
Build OpenResty+OpenCC in Docker
This should be the easiest part of the whole article to understand. In short, to achieve the goal, we first need an OpenResty and OpenCC. After searching, I found the project hustshawn/openresty-opencc-docker.
It looks like there's nothing wrong with its Dockerfile, just that the version of OpenResty is a bit old, so I forked it, changed the version a bit, and built my own Docker image rexskz/openresty-opencc-docker. To start it up, just write it like this in compose.yml (the reason why the name is still nginx is because it's essentially an Nginx, just like the executable of MariaDB is still called mysql):
services:
nginx:
image: rexskz/openresty-opencc-docker
container_name: nginx
hostname: nginx
# .... other configurations omittedAfter starting the container and looking around, I found that all the data files related to OpenCC were in the /usr/share/opencc/ directory. Since I wanted to convert Simplified Chinese to Traditional Chinese in Taiwan, and needed to convert specialized vocabulary and regional idioms, I needed to load the s2twp.json file.
Understanding Some Basic Knowledge of OpenResty
The reason why OpenResty can do many things is that it has a built-in LuaJIT, which can use the Lua scripting language to customize processing logic. For example, CloudFlare uses it to do WAF. Our goal this time is very simple - if we find a cc_lang=zh_TW cookie, we will convert the Chinese characters in the response to Traditional Chinese in Taiwan.
After searching the OpenResty documentation, I found two directives that could meet my needs: body_filter_by_lua_block and body_filter_by_lua_file (as for body_filter_by_lua, the documentation says it is no longer recommended). The former is to write Lua code directly into the Nginx configuration file, and the latter is to write it into a specific Lua file. I chose the latter because it would be more convenient in the development environment: you can add the lua_code_cache off directive in the Nginx configuration file, and you don't need to nginx -s reload every time you modify the Lua file.
Understanding OpenResty Lua Files
The reason why OpenResty can call Lua to modify requests and responses is that in LuaJIT, you can use a global variable called ngx to get Nginx data or interact with Nginx, such as the Lua script loaded in body_filter_by_lua:
- You can use
ngx.arg[1]to get the Nginx Response body; - You can use
ngx.var['cookie_cc_lang']to get the value of the Cookie namedcc_lang; - You can use
ngx.log(ngx.NOTICE, "xxx")to output a sentence to the "notice" level log of Nginx.
The complete ngx API is here: Lua Ngx API - OpenResty Reference.
As for Lua's FFI, after reading the documentation, I found it was quite simple, just a few steps:
require('ffi')(of course);- Define the function signature, using C syntax;
- Load the dynamic link library with a command;
- Call it directly like JavaScript.
To achieve the goal of this article, it is enough to read only one page: FFI · OpenResty最佳实践.
Understanding OpenCC's C API
All C APIs that can be called through FFI are here: Open Chinese Convert: OpenCC C API. There are only a few functions, and we can get the calling sequence from the description:
opencc_t inst = opencc_open("/usr/share/opencc/s2twp.json");
// Using UTF-8 encoding
char *input = "这是一段文字";
char *output = opencc_convert_utf8(inst, input, strlen(input));
// Of course, we can also use this idea: that is, allocate space for output in advance
// char *output = (char *)malloc(sizeof(char) * strlen(input) + 1);
// opencc_convert_utf8_to_buffer(inst, input, strlen(input), output);
// Find a way to save output
....
// Finally, some cleaning is needed, otherwise there may be memory leaks
opencc_convert_utf8_free(output);
opencc_close(inst);Start the Formal Practice
After sorting out the ideas, we can start coding! First, create a file named nginx-opencc-filter.lua and mount it to the OpenResty container:
services:
nginx:
volumes:
# This is the configuration file of my website
- ./nginx-docker-live.conf:/path/to/rexskz.conf
# This is the newly created .lua file
- ./nginx-opencc-filter.lua:/path/to/opencc-filter.lua
# .... other configurations omittedThen load this Lua file in the OpenResty configuration file:
# Add this configuration only to the pages you need to convert, not all locations
location /your-page {
body_filter_by_lua_file /path/to/opencc-filter.lua;
# .... other configurations omitted
}Next, let's start writing the Lua script! As for the simple syntax of Lua, there are many online resources, so I won't go into it here. The whole script is like this, the basic idea is similar to the C code we just wrote, and should be very easy to understand:
local cookie_value = ngx.var['cookie_cc_lang']
if cookie_value == 'zh_TW' then
local ffi = require('ffi')
ffi.cdef[[
typedef void* opencc_t;
opencc_t opencc_open(const char *configFileName);
int opencc_close(opencc_t opencc);
char* opencc_convert_utf8(opencc_t opencc, const char *input, size_t length);
void opencc_convert_utf8_free(char *str);
]]
local cc = ffi.load('opencc')
local inst = cc.opencc_open('/usr/share/opencc/s2twp.json')
local input = ngx.arg[1]
local res = cc.opencc_convert_utf8(inst, input, #input)
ngx.arg[1] = ffi.string(res)
cc.opencc_convert_utf8_free(res)
cc.opencc_close(inst)
endAfter adding the cookie and refreshing, I found that the page was already in Traditional Chinese! The next step is to add links to Simplified Chinese and Traditional Chinese in the appropriate places on the page.
Bug Caused by Buffer
Strangely, not all pages can be successfully converted. Some pages can only load a small part of the text, and some pages even prompt ERR_INCOMPLETE_CHUNKED_ENCODING directly. Going back to look at the Nginx log, I found that as long as I visited such a page, there would be an extra line worker process 8 exited on signal 11. I looked at it for a long time, and felt that the Lua script should not have a bug, but the Lua script of OpenResty is not convenient for single-step debugging. So I used the debugging method I was proficient in eight years ago - debugging by printing. At the beginning, I output ngx.arg[1], saved and refreshed, and found a bunch of garbled characters under UTF-8... Yes, the kind with a question mark inside a black diamond.
For a normal request, it will output once or more times, and there is no such garbled code each time; but for an abnormal request, there will always be such garbled code in the last output, and then the Worker process crashes. Experience tells me that it must be because Nginx, in order to speed up, processes requests in a streaming manner, splitting requests and responses into several chunks and sending them to the Lua script. Once the split position happens to be in the middle of a Chinese character (occupying 3 bytes), the Chinese character will be split into two garbled characters and scattered into two chunks.
OpenCC does not need to be compatible with this special case, and even if it is compatible, it cannot convert this character (or even the word in which this character is located). Therefore, what we can do is to prevent Nginx from splitting the response body of the page. Since my website is FastCGI + PHP-FPM (an ancient technology), you only need to increase the value of the buffer (I tried to set fastcgi_buffering off directly, but it seems to have no effect, I don't know why):
fastcgi_buffers 8 256k;
fastcgi_buffer_size 256k;
fastcgi_busy_buffers_size 512k;As for the case of using proxy_pass, there are equivalent proxy_buffer_size and other directives, which are used in exactly the same way.
So... my website now supports Traditional Chinese.