数据安全是做数据分析的人需要关注的一大问题。对于我们分析的关键数据、使用的关键脚本都需要定期备份。
最简单的备份方式,就是使用cp (本地硬盘)或scp (远程硬盘)命令,给自己的结果文件新建一个拷贝;每有更新,再拷贝一份。具体命令如下:
cp -fur source_project project_bak scp -r source_project user@remote_server_ip:project_bak为了实现定期备份,我们可以把上述命令写入crontab程序中,设置每天的晚上23:00执行。对于远程服务器的备份,我们可以配置免密码登录,便于自动备份。后台输入免密码登录服务器,获取免密码登录服务器的方法。
# Crontab format # MinuteHourDayMonthWeekcommand # * 表示每分/时/天/月/周 # 每天23:00 执行cp命令 0 23 * * * cp -fur source_project project_bak # */2 表示每隔2分分/时/天/月/周执行命令 # 每隔24小时执行cp命令 0 */24 * * * cp -fur source_project project_bak 0 0 */1 * * scp -r source_project user@remote_server_ip:project_bak # 另外crotab还有个特殊的时间 # @reboot: 开机运行指定命令 @reboot cmdrsync
cp或scp使用简单,但每次执行都会对所有文件进行拷贝,耗时耗力,尤其是需要拷贝的内容很多时,重复拷贝对时间和硬盘都是个损耗。
rsync则是一个增量备份工具,只针对修改过的文件的修改过的部分进行同步备份,大大缩短了传输的文件的数量和传输时间。具体使用如下 :
# 把本地project目录下的东西备份到远程服务器的/backup/project目录下 # 注意***个project后面的反斜线,表示拷贝目录内的内容,不在目标目录新建project文件夹。注意与第二个命令的比较,两者实现同样的功能。 # -a: archive mode, quals -rlptgoD # -r: 递归同步 # -p: 同步时保留原文件的权限设置 # -u: 若文件在远端做过更新,则不同步,避免覆盖远端的修改 # -L: 同步符号链接链接的文件,防止在远程服务器出现文件路径等不匹配导致的软连接失效 # -t: 保留修改时间 # -v: 显示更新信息 # -z: 传输过程中压缩文件,对于传输速度慢时适用 rsync -aruLptvz –delete project/ user@remoteServer:/backup/project rsync -aruLptvz –delete project user@remoteServer:/backup/rsync所做的工作为镜像,保证远端服务器与本地文件的统一。如果本地文件没问题,远端也不会有问题。但如果发生误删或因程序运行错误,导致文件出问题,而在同步之前又没有意识到的话,远端的备份也就没了备份的意义,因为它也被损坏了。误删是比较容易发现的,可以及时矫正。但程序运行出问题,则不一定了。
rdiff-backup
这里推荐一个工具rdiff-backup不只可以做增量备份,而且会保留每次备份的状态,新备份和上一次备份的差别,可以轻松回到之前的某个版本。***的要求就是,本地服务器和远端服务器需要安装统一版本的rdiff-backup。另外还有2款工具 duplicity和`Rsnapshot也可以做类似工作,但方法不一样,占用的磁盘空间也不一样,具体可查看原文链接中的比较。
具体的rdiff-backup安装和使用如下 (之前写的是英文,内容比较简单,就不再翻译了):
Install rdiff-backup at both local and remote computers #install for ubuntu, debian sudo apt-get install python-dev librsync-dev #self compile #downlaod rsync-dev from https://sourceforge.net/project/showfiles.php?group_id=56125 tar xvzf librsync-0.9.7.tar.gz export CFLAGS=“$CFLAGS -fPIC” ./configure –prefix=/home/user/rsync –with-pic make make install Install rdiff-backup #See Reference part for download link # http://www.nongnu.org/rdiff-backup/ python setup.py install –prefix=/home/user/rdiff-backup #If you complied rsync-dev yourself, please specify the location of rsync-dev python setup.py –librsync-dir=/home/user/rsync install — prefix=/home/user/rdiff-backup Add exeutable files and python modules to environmental variables #Add the following words into .bashrc or .bash_profile orany other config files export PATH=${PATH}:/home/user/rdiff-backup/bin export PYTHONPATH=${PYTHONPATH}:/home/user/rdiff-backup/lib/python2.x/site-packages #pay attention to the x in python2.x of above line which can be 6 or 7 depending on #the Python version used. Test environmental variable when executing commands through ssh ssh user@host echo ${PATH} #When I run this command in my local computer, #I found only system environmetal variable is used #and none of my self-defined environmetal variable is used. #Then, I modified the following lines in file SetConnections.pyin #/home/user/rdiff-backup/lib/python2.x/site-packages/rdiff_backup #toset environmental explicitly when login. #pay attention to the single quote used inside double quote __cmd_schema = “ssh -C %s source ~/.bash_profile; rdiff-backup –server” __cmd_schema_no_compress = “ssh %s source ~/.bash_profile; rdiff-backup –server” #choose the one contains environmental variable for rdiff-backup from .bash_profile and .bashrc.Use rdiff-backup
Start backuprdiff-backup –no-compression –print-statistics user@host::/home/user/source_dir destination_dir
If the destination_dir exists, please add –force like rdiff-backup –no-compression –force –print-statistics user@host::/home/user/source_dir destination_dir. All things in original destination_dir will be depleted.
If you want to exclude or include special files or dirs please specify like –exclude **trash or –include /home/user/source_dir/important.
Timely backup your dataAdd the above command into crontab (hit crontab -e in terminal to open crontab) in the format like 5 22 */1 * * command which means executing the command at 22:05 everyday.
Restore dataRestore the latest data by running rdiff-backup -r now destination_dir user@host::/home/user/source_dir.restore. Add –force if you want to restore to source_dir.
Restore files 10 days ago by running rdiff-backup -r 10D destination_dir user@host::/home/user/source_dir.restore. Other acceptable time formats include 5m4s (5 minutes 4 seconds) and 2014-01-01 (January 1st, 2014).
Restore files from an increment file by running rdiff-backup destination_dir/rdiff-backup-data/increments/server_add.2014-02-21T09:22:45+08:00.missing user@host::/home/user/source_dir.restore/server_add. Increment files are stored in destination_dir/rdiff-backup-data/increments/server_add.2014-02-21T09:22:45+08:00.missing.
Remove older records to save spaceDeletes all information concerning file versions which have not been current for 2 weeks by running rdiff-backup –remove-older-than 2W –force destination_dir. Note that an existing file which has not changed for a year will still be preserved. But a file which was deleted 15 days ago can not be restored after this command. Normally one should use –force since it is used to delete multiple increments at the same time which –remove-older-thanrefuses to do by default.
Only keeps the last n rdiff-backup sessions by running rdiff-backup –remove-older-than 20B –force destination_dir.
StatisticsLists increments in given golder by rdiff-backup –list-increments destination_dir/.
Lists of files changed in last 5 days by rdiff-backup –list-changed-since 5D destination_dir/.
Compare the difference between source and bak by rdiff-backup –compare user@host::source-dir destination_dir
Compare the sifference between source and bak (as it was two weeks ago) by rdiff-backup –compare-at-time 2W user@host::source-dir destination_dir.
A complete script (automatically sync using crontab)
#!/bin/bash export PYTHONPATH=${PYTHONPATH}:/soft/rdiff_backup/lib/python2.7/site-packages/ rdiff-backup –no-compression -v5 –exclude **trash user@server::source/ bak_dir/ ret=$? if test $ret -ne 0; then echo “Wrong in bak” | mutt -s “Wrong in bak” bak@mail.com else echo “Right in bak” | mutt -s “Right in bak” bak@mail.com fi echo “Finish rdiff-backup $0 —`date`—“ >>bak.log 2>&1 echo “`rdiff-backup –exclude **trash –compare-at-time 1D user@server::source/ bak_dir/`” | mutt -s “Lists of baked files” bak@mail.comReferences
rdiff-backup duplicity rsnapshot http://www.saltycrane.com/blog/2008/02/backup-on-linux-rsnapshot-vs-rdiff/ http://james.lab6.com/2008/07/09/rdiff-backup-and-duplicity/ http://bitflop.com/document/75 http://askubuntu.com/questions/2596/comparison-of-backup-tools http://www.reddit.com/r/linux/comments/fgmbb/rdiffbackup_duplicity_or_rsnapshot_which_is/ http://serverfault.com/questions/491341/optimize-space-rdiff-backup Another great post on usage of rdiff-backup原文链接:https://mp.weixin.qq.com/s/Ovl46SbnQLc5q6Rz3Iaczg