CRAC通过快照的方式,可以快速的启动应用程序,在springboot3.2之后的版本中也提供了相关的支持,基于这个背景,对CRAC做一定的尝试,看看是否能在启动速度这块有更好的体验。主要参考spring-boot-crac-demo
中的示例。
JDK 版本
CRAC需要特定的jdk版本支持才可以,最初使用的是openjdk17的版本做测试,过程中还出现了比较多的的问题。这里建议使用:https://cdn.azul.com/zulu/bin/zulu17.44.55-ca-crac-jdk17.0.8.1-linux_x64.tar.gz
– crac 底层依赖了criu,很多原理性的内容,可以参考 https://criu.org/Main_Page
代码执行示例
理论上使用springboot-3.2.2版本就可以支持crac了,但需要在classpath下添加crac的依赖
1 2 3 4 5 6
| <dependency> <groupId>org.crac</groupId> <artifactId>crac</artifactId> </dependency>
|
启动的国产中自动做checkpoint
1 2 3
| $JAVA_HOME/bin/java -Dspring.context.checkpoint=onRefresh -Dmanagement.endpoint.health.probes.add-additional-paths="true" -Dmanagement.health.probes.enabled="true" -XX:CRaCCheckpointTo=/tmp/cr -jar ./target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar
|
如果在使用过程中存在异常,可以添加额外的参数,让异常信息更丰富 -Djdk.crac.collect-fd-stacktraces=true
从快照中恢复
1 2 3
| $JAVA_HOME/bin/java -Dmanagement.endpoint.health.probes.add-additional-paths="true" -Dmanagement.health.probes.enabled="true" -XX:CRaCRestoreFrom=/tmp/cr
|
恢复过程中的异常
如果恢复过程中,程序使用的pid已经被占了,会导致恢复不出来,此时可以通过unshare -p -m –fork –mount-proc 来启动恢复的程序。详细参考: https://criu.org/When_C/R_fails
1 2 3
| sudo unshare -p -m --fork --mount-proc $JAVA_HOME/bin/java -Dmanagement.endpoint.health.probes.add-additional-paths="true" -Dmanagement.health.probes.enabled="true" -XX:CRaCRestoreFrom=/tmp/cr
|
checkpoint和恢复的示例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
| johnzb@DESKTOP-2RFOQOL:~/code/spring-boot-crac-demo$ $JAVA_HOME/bin/java -Dspring.context.checkpoint=onRefresh -Dmanagement.endpoint.health.probes.add-additional-paths="true" -Dmanagement.health.probes.enabled="true" -XX:CRaCCheckpointTo=/tmp/cr -jar ./target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar ----- the begin ------
. ____ _ __ _ _ /\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \ ( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \ \\/ ___)| |_)| | | | | || (_| | ) ) ) ) ' |____| .__|_| |_|_| |_\__, | / / / / =========|_|==============|___/=/_/_/_/ :: Spring Boot :: (v3.2.2)
2024-05-25T23:22:09.070+08:00 INFO 862 --- [ main] com.example.Application : Starting Application v1.0.0-SNAPSHOT using Java 17.0.8.1 with PID 862 (/home/johnzb/code/spring-boot-crac-demo/target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar started by johnzb in /home/johnzb/code/spring-boot-crac-demo) 2024-05-25T23:22:09.074+08:00 INFO 862 --- [ main] com.example.Application : No active profile set, falling back to 1 default profile: "default" 2024-05-25T23:22:10.153+08:00 INFO 862 --- [ main] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat initialized with port 8080 (http) 2024-05-25T23:22:10.165+08:00 INFO 862 --- [ main] o.apache.catalina.core.StandardService : Starting service [Tomcat] 2024-05-25T23:22:10.165+08:00 INFO 862 --- [ main] o.apache.catalina.core.StandardEngine : Starting Servlet engine: [Apache Tomcat/10.1.18] 2024-05-25T23:22:10.197+08:00 INFO 862 --- [ main] o.a.c.c.C.[Tomcat].[localhost].[/] : Initializing Spring embedded WebApplicationContext 2024-05-25T23:22:10.199+08:00 INFO 862 --- [ main] w.s.c.ServletWebServerApplicationContext : Root WebApplicationContext: initialization completed in 1038 ms 2024-05-25T23:22:10.652+08:00 INFO 862 --- [ main] o.s.b.a.e.web.EndpointLinksResolver : Exposing 1 endpoint(s) beneath base path '/actuator' 2024-05-25T23:22:10.682+08:00 INFO 862 --- [ main] o.s.c.support.DefaultLifecycleProcessor : Triggering JVM checkpoint/restore 2024-05-25T23:22:10.685+08:00 INFO 862 --- [ main] jdk.crac : /home/johnzb/code/spring-boot-crac-demo/target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar is recorded as always available on restore CR: Checkpoint ... Killed johnzb@DESKTOP-2RFOQOL:~/code/spring-boot-crac-demo$ $JAVA_HOME/bin/java -Dmanagement.endpoint.health.probes.add-additional-paths="true" -Dmanagement.health.probes.enabled="true" -XX:CRaCRestoreFrom=/tmp/cr Error (criu/cr-restore.c:1518): Can't fork for 862: Operation not permitted Error (criu/cr-restore.c:2605): Restoring FAILED. johnzb@DESKTOP-2RFOQOL:~/code/spring-boot-crac-demo$ sudo $JAVA_HOME/bin/java -Dmanagement.endpoint.health.probes.add-additional-paths="true" -Dmanagement.health.probes.enabled="true" -XX:CRaCRestoreFrom=/tmp/cr [sudo] password for johnzb: shm_open: Permission denied 2024-05-25T23:22:24.331+08:00 INFO 862 --- [ main] o.s.c.support.DefaultLifecycleProcessor : Restarting Spring-managed lifecycle beans after JVM restore 2024-05-25T23:22:24.390+08:00 INFO 862 --- [ main] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port 8080 (http) with context path '' 2024-05-25T23:22:24.409+08:00 INFO 862 --- [ main] com.example.Application : Restored Application in 0.096 seconds (process running for 0.096) ^Cjohnzb@DESKTOP-2RFOQOL:~/code/spring-boot-crac-demo$ ^C
|
快照的内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| johnzb@DESKTOP-2RFOQOL:~/code/spring-boot-crac-demo$ ll /tmp/cr/ total 263464 drwx------ 2 johnzb johnzb 12288 May 25 22:49 ./ drwxrwxrwt 20 root root 4096 May 25 23:02 ../ -rw-r--r-- 1 johnzb johnzb 2156 May 25 22:42 core-151378.img -rw-r--r-- 1 johnzb johnzb 7 May 25 22:49 cppath -rw------- 1 johnzb johnzb 508299 May 25 22:49 dump4.log -rw-r--r-- 1 johnzb johnzb 116 May 25 22:49 fdinfo-2.img -rw-r--r-- 1 johnzb johnzb 4284 May 25 22:49 files.img -rw-r--r-- 1 johnzb johnzb 18 May 25 22:49 fs-155623.img -rw-r--r-- 1 johnzb johnzb 36 May 25 22:49 ids-155623.img -rw-r--r-- 1 johnzb johnzb 46 May 25 22:49 inventory.img -rw-r--r-- 1 johnzb johnzb 12383 May 25 22:42 mm-151378.img -rw-r--r-- 1 johnzb johnzb 12319 May 25 22:49 mm-155623.img -rw-r--r-- 1 johnzb johnzb 9626 May 25 22:42 pagemap-151378.img -rw-r--r-- 1 johnzb johnzb 9792 May 25 22:43 pagemap-151815.img -rw-r--r-- 1 johnzb johnzb 229 May 25 22:49 pstree.img -rw-r--r-- 1 johnzb johnzb 12 May 25 22:49 seccomp.img -rw-r--r-- 1 johnzb johnzb 54 May 25 22:49 stats-dump -rw-r--r-- 1 johnzb johnzb 36 May 25 22:49 timens-0.img -rw-r--r-- 1 johnzb johnzb 207 May 25 22:49 tty-info.img
|
Crac 面临的问题
由于crac是通过对运行中的程序做checkpoint,随后在新的机器上做restore,快速启动程序,这里就面对几个需要解决的问题
- checkpoint做的时机选择,如果checkpoint选择的比较早,那么restore后需要做的事情还会很多,checkpoint、restore没有意义。如果选择的点比较晚的话,程序中会有很多状态信息,会导致restore的时候,需要处理非常多的异常情况。
- 由于crac是对进程信息做快照,实际一个程序中会有非常多的静态final变量,以及和机器设备有关系的全局变量,比如ip信息、时间信息等内容。 这块多且复杂,目前没有一个比较简单快速的方式处理。这是一个必须要面对的问题。
参考资料:
- https://docs.spring.io/spring-framework/reference/integration/checkpoint-restore.html
- https://github.com/sdeleuze/spring-boot-crac-demo
- https://openjdk.org/projects/leyden/notes/02-shift-and-constrain