Nathan DeBardeleben
Title
Cited by
Cited by
Year
Addressing failures in exascale computing
M Snir, RW Wisniewski, JA Abraham, SV Adve, S Bagchi, P Balaji, J Belak, ...
The International Journal of High Performance Computing Applications 28 (2 …, 2014
3212014
Memory errors in modern systems: The good, the bad, and the ugly
V Sridharan, N DeBardeleben, S Blanchard, KB Ferreira, J Stearley, ...
ACM SIGARCH Computer Architecture News 43 (1), 297-310, 2015
2112015
Feng shui of supercomputer memory positional effects in DRAM and SRAM faults
V Sridharan, J Stearley, N DeBardeleben, S Blanchard, S Gurumurthi
SC'13: Proceedings of the International Conference on High Performance …, 2013
1642013
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
D Tiwari, S Gupta, J Rogers, D Maxwell, P Rech, S Vazhkudai, D Oliveira, ...
2015 IEEE 21st International Symposium on High Performance Computer …, 2015
992015
High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development
N DeBardeleben, J Laros, JT Daly, SL Scott, C Engelmann, B Harrod
Whitepaper, Dec, 2009
712009
F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability
Q Guan, N Debardeleben, S Blanchard, S Fu
Proceedings of the 2014 IEEE 28th International Parallel and Distributed …, 2014
572014
GPGPUs: How to Combine High Computational Power with High Reliability
LB Gomez, F Cappello, L Carro, N DeBardeleben, B Fang, S Gurumurthi, ...
562014
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
WM Jones, JT Daly, N DeBardeleben
Proceedings of the 19th ACM International Symposium on High Performance …, 2010
432010
On the diversity of cluster workloads and its impact on research results
G Amvrosiadis, JW Park, GR Ganger, GA Gibson, E Baseman, ...
2018 {USENIX} Annual Technical Conference ({USENIX}{ATC} 18), 533-546, 2018
422018
Application monitoring and checkpointing in HPC: looking towards exascale systems
WM Jones, JT Daly, N DeBardeleben
Proceedings of the 50th Annual Southeast Regional Conference, 262-267, 2012
332012
Inter-agency workshop on hpc resilience at extreme scale
J Daly, B Harrod, T Hoang, L Nowell, B Adolf, S Borkar, N DeBardeleben, ...
National Security Agency Advanced Computing Systems, 2012
332012
Developing scientific applications using eclipse
GR Watson, NA DeBardeleben
Computing in Science & Engineering 8 (4), 50-61, 2006
332006
Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience
N DeBardeleben, S Blanchard, Q Guan, Z Zhang, S Fu
European Conference on Parallel Processing, 282-291, 2011
292011
GPU behavior on a large HPC cluster
N DeBardeleben, S Blanchard, L Monroe, P Romero, D Grunau, C Idler, ...
European Conference on Parallel Processing, 680-689, 2013
252013
Towards practical algorithm based fault tolerance in dense linear algebra
P Wu, Q Guan, N DeBardeleben, S Blanchard, D Tao, X Liang, J Chen, ...
Proceedings of the 25th ACM International Symposium on High-Performance …, 2016
222016
Experimental and analytical study of xeon phi reliability
D Oliveira, L Pilla, N DeBardeleben, S Blanchard, H Quinn, I Koren, ...
Proceedings of the International Conference for High Performance Computing …, 2017
202017
Silent data corruption resilient two-sided matrix factorizations
P Wu, N DeBardeleben, Q Guan, S Blanchard, J Chen, D Tao, X Liang, ...
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of …, 2017
172017
Open| SpeedShop: open source performance analysis for Linux clusters
M Schulz, S Cranford, N DeBardeleben, JE Galarowicz, D Maghrak
Proceedings of the 2006 ACM/IEEE conference on Supercomputing, 211-es, 2006
172006
Exploring time and frequency domains for accurate and automated anomaly detection in cloud computing systems
Q Guan, S Fu, N DeBardeleben, S Blanchard
2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing …, 2013
162013
Interpretable anomaly detection for monitoring of high performance computing systems
E Baseman, S Blanchard, N DeBardeleben, A Bonnie, A Morrow
22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining-Outlier …, 2016
152016
The system can't perform the operation now. Try again later.
Articles 1–20