Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

./scripts/train_nsvf_lego.sh: line 11: 18824 Segmentation fault #76

Open
guwenxiang1 opened this issue Jul 6, 2023 · 1 comment
Open

Comments

@guwenxiang1
Copy link

guwenxiang1 commented Jul 6, 2023

Thanks for your great work, I encountered the following error when running the ./scripts/train_nsvf_lego.sh script:
My GPU is RTX3090 and the system is Ubuntu 18.04.6 LTS.
error message are show in the "details":
[Taichi] version 1.7.0, llvm 15.0.4, commit a992f22e, linux, python 3.9.16
[Taichi] Starting on arch=cuda
Loading 100 train images ...
100it [00:02, 38.78it/s]
Loading 200 test images ...
200it [00:05, 39.27it/s]
Hash Encoder: base_res=16 max_res=1024 hash_level=16 feat_per_level=2 per_level_scale=0.2772588722239781 total_hash_size=5710032
Failed to import apex FusedAdam, use torch Adam instead.
[W 07/06/23 20:09:27.367 19027] [type_check.cpp:type_check_store@37] [$13858] Local store may lose precision: f16 <- f32

[W 07/06/23 20:09:27.367 19027] [type_check.cpp:type_check_store@37] [$13883] Local store may lose precision: f16 <- f32

[W 07/06/23 20:09:27.367 19027] [type_check.cpp:type_check_store@37] [$13908] Local store may lose precision: f16 <- f32

elapsed_time=2.19s | step=0 | psnr=10.84 | loss=0.082505 | rays=8192 | rm_s=246.9 | vr_s=246.9 |
elapsed_time=18.99s | step=1000 | psnr=28.49 | loss=0.001417 | rays=8192 | rm_s=26.7 | vr_s=14.3 |
elapsed_time=31.39s | step=2000 | psnr=31.23 | loss=0.000753 | rays=8192 | rm_s=24.6 | vr_s=9.8 |
elapsed_time=43.47s | step=3000 | psnr=31.70 | loss=0.000675 | rays=8192 | rm_s=24.9 | vr_s=9.0 |
elapsed_time=55.69s | step=4000 | psnr=32.43 | loss=0.000572 | rays=8192 | rm_s=24.3 | vr_s=8.3 |
elapsed_time=68.28s | step=5000 | psnr=33.79 | loss=0.000418 | rays=8192 | rm_s=22.9 | vr_s=8.1 |
elapsed_time=81.02s | step=6000 | psnr=33.98 | loss=0.000400 | rays=8192 | rm_s=23.7 | vr_s=7.1 |
elapsed_time=93.21s | step=7000 | psnr=34.45 | loss=0.000359 | rays=8192 | rm_s=23.5 | vr_s=7.0 |
elapsed_time=105.40s | step=8000 | psnr=35.15 | loss=0.000305 | rays=8192 | rm_s=23.7 | vr_s=7.0 |
elapsed_time=118.02s | step=9000 | psnr=35.66 | loss=0.000272 | rays=8192 | rm_s=23.4 | vr_s=6.7 |
elapsed_time=130.00s | step=10000 | psnr=35.29 | loss=0.000296 | rays=8192 | rm_s=23.0 | vr_s=6.6 |
elapsed_time=142.20s | step=11000 | psnr=34.47 | loss=0.000357 | rays=8192 | rm_s=23.4 | vr_s=6.6 |
elapsed_time=154.58s | step=12000 | psnr=35.71 | loss=0.000269 | rays=8192 | rm_s=24.0 | vr_s=6.5 |
elapsed_time=166.82s | step=13000 | psnr=36.06 | loss=0.000248 | rays=8192 | rm_s=23.7 | vr_s=6.6 |
elapsed_time=179.23s | step=14000 | psnr=36.39 | loss=0.000230 | rays=8192 | rm_s=22.9 | vr_s=6.3 |
elapsed_time=191.38s | step=15000 | psnr=36.37 | loss=0.000231 | rays=8192 | rm_s=23.4 | vr_s=6.4 |
elapsed_time=203.60s | step=16000 | psnr=36.54 | loss=0.000222 | rays=8192 | rm_s=23.4 | vr_s=6.6 |
elapsed_time=216.32s | step=17000 | psnr=37.19 | loss=0.000191 | rays=8192 | rm_s=23.1 | vr_s=6.5 |
elapsed_time=229.25s | step=18000 | psnr=37.12 | loss=0.000194 | rays=8192 | rm_s=22.7 | vr_s=6.1 |
elapsed_time=241.81s | step=19000 | psnr=37.59 | loss=0.000174 | rays=8192 | rm_s=22.7 | vr_s=6.4 |
elapsed_time=253.84s | step=20000 | psnr=37.10 | loss=0.000195 | rays=8192 | rm_s=22.5 | vr_s=6.1 |
evaluating: 0%| | 0/200 [00:00<?, ?it/s][W 07/06/23 20:13:40.575 18824] [type_check.cpp:type_check_store@37] [$28398] Global store may lose precision: u8 <- i32
File "/data2/gwx/taichi-nerfs/modules/ray_march.py", line 254, in raymarching_test_kernel:
valid_mask[idx] = 1
^^^^^^^^^^^^^^^^^^^

evaluating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:10<00:00, 19.48it/s]
evaluation: psnr_avg=34.72092056274414 | ssim_avg=0.9757876992225647
[Taichi] Starting on arch=cuda
Loading 100 train images ...
100it [00:02, 39.17it/s]
Hash Encoder: base_res=16 max_res=1024 hash_level=16 feat_per_level=2 per_level_scale=0.2772588722239781 total_hash_size=5710032
loading ckpt from: results/model.pth
./scripts/train_nsvf_lego.sh: 行 11: 18824 段错误 (核心已转储) python3 train.py --root_dir $DATA_DIR/Lego --exp_name Lego --batch_size 8192 --lr 1e-2 --gui

@jexiaong
Copy link

jexiaong commented Jul 6, 2023

I'm also running into a similar issue however I segfault earlier than you did,
I used a NVIDIA GeForce RTX 3080 Ti and wsl Ubuntu 18.04.6 LTS

[Taichi] version 1.7.0, llvm 15.0.4, commit a992f22e, linux, python 3.8.17
[W 07/06/23 14:41:38.472 29689] [cuda_driver.cpp:load_lib@36] libcuda.so lib not found.
[W 07/06/23 14:41:38.473 29689] [misc.py:adaptive_arch_select@747] Arch=[<Arch.cuda: 3>] is not supported, falling back to CPU
[Taichi] Starting on arch=x64
Loading 100 train images ...
100it [00:02, 48.16it/s]
Loading 200 test images ...
200it [00:04, 43.80it/s]
Hash Encoder: base_res=16 max_res=1024 hash_level=16 feat_per_level=2 per_level_scale=0.2772588722239781 total_hash_size=5710032
Failed to import apex FusedAdam, use torch Adam instead.
[W 07/06/23 14:41:49.776 29751] [type_check.cpp:type_check_store@37] [$14165] Local store may lose precision: f16 <- f32

[W 07/06/23 14:41:49.776 29751] [type_check.cpp:type_check_store@37] [$14190] Local store may lose precision: f16 <- f32

[W 07/06/23 14:41:49.776 29751] [type_check.cpp:type_check_store@37] [$14215] Local store may lose precision: f16 <- f32

./scripts/train_nsvf_lego.sh: line 11: 29689 Segmentation fault python3 train.py --root_dir $DATA_DIR/Lego --exp_name Lego --batch_size 8192 --lr 1e-2 --gui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
  NODES
COMMUNITY 2
Project 5
USERS 1