snowball uploader下载 - snowball uploader源代码下载

snowball uploader

其他源码

1.0.0

下载

雪球上传器

一个将数十亿个文件移至有效滚雪球的脚本

日期：2021年2月20日
撰写者：Yongki Kim（[email protected]）

更改日志

  - 2022.01.19
    - added option to bypass setting the auto-extract metadata tag
  - 2021.02.20
    - save filelist_dir as filelist-currentdata.gz when executing genlist
  - 2021.02.20
    - performance improvement of genlist; dumping file list, not each line
  - 2021.02.20
    - replacing scandir.walk to os.walk. already os.walk module patched with scandir after python3.5
  - 2021.02.10
    - replacing os.path with scandir.path to improve performance of file listing
  - 2021.02.09
    - python2 compatibility for "open(filename, endoding)"
  - 2021.02.01
    - modifying to support Windows
    - refactoring for more accurate defining of variables
  - 2021.01.26
    - multi processing support for parallel uploading of tar files
    - relevant parameter: max_process
  - 2021.01.25
    - removing yaml feature, due for it to cause too much cpu consumtion and low performance
    - fixing bug which use two profiles(sbe1, default), now only use "sbe1" profile
    - showing progress
  - 2020.02.25
    - changing filelist file to contain the target filename
  - 2020.02.24
    - fixing FIFO error
    - adding example of real snowball configuration
  - 2020.02.22 - limiting multi-thread numbers
    - adding multi-threading to improve performance 
    - adding fifo operation to reducing for big file which is over max_part_size 
  - 2020.02.19
    - removing tarfiles_one_time logic
    - spliting buffer by max_part_size
  - 2020.02.18:
    - supprt snowball limit:
      - max_part_size: 512mb
      - min_part_size: 5mb
  - 2020.02.14: 
    - modifying for python3 
    - support korean in Windows
  - 2020.02.12: adding features 
    - gen_filelist by size
  - 2020.02.10: changing filename from tar_to_s3_v7_multipart.py to snowball_uploader_8.py
  - adding features which can split tar file by size and count.
  - adding feature which create file list
  - showing help message

介绍

Snowball_uploader的开发是为了将许多文件有效地移至雪球或雪球，这是AWS的设备，可以将PBABYTE文件迁移到S3。特别是，当有数百万个小文件时，将它们传输时间太长，然后会延迟项目并造成贷款贷款的高成本。但是，使用snowball_uploader ，您可以缩短转移时间。它将文件归档到内存中的一部分，并发送大块，并在几个焦油文件中汇总。

单独上传文件和使用脚本上传的性能比较

起初，我会向您展示性能结果。在更改名称时上传每个文件时，测量了第一个雪球结果，并在应用脚本时测量第二个结果，该脚本用焦油制作存档文件并在内存中发送到雪球。在下表和数字下，您会注意到第二个选项的性能至少提高了7倍。

第一次雪球表演：用AWS S3 CP上传每个文件
第二个雪球表演：上传大块的文件snowball_uploader草稿版本

目标	文件数	总容量	nas->雪球时间	雪球 - > S3时间	故障对象
第一场雪球表演	19,567,430	2,408 GB	1W	113小时	954
第二个雪球表演	大约119,577,235	14,708 GB	1W	26小时	0

雪球边缘手册

雪球边缘数据迁移：https：//d1.awsstatic.com/whitepapers/snowball-edge-data migration-guide.pdf?did=wp_card&trk=wp_card

用法

先决条件

Python3.5
- Python2也可以工作，但只有英文文件名
boto3
awscli

执行

更改参数

 bucket_name = "your-own-bucket"
session = boto3 . Session ( profile_name = 'sbe1' )
s3 = session . client ( 's3' , endpoint_url = 'http://10.10.10.10:8080' )
# or below
#s3 = boto3.client('s3', endpoint_url='https://s3.ap-northeast-2.amazonaws.com')
#s3 = boto3.client('s3', region_name='ap-northeast-2', endpoint_url='https://s3.ap-northeast-2.amazonaws.com', aws_access_key_id=None, aws_secret_access_key=None)
target_path = '/move/to/s3/orgin/'   ## very important!! change to your source directory
max_tarfile_size = 10 * 1024 ** 3 # 10GB
max_part_size = 300 * 1024 ** 2 # 300MB
min_part_size = 5 * 1024 ** 2 # 5MB
max_process = 5  # concurrent processes, set the value to less than filelist files in file list_dir
if os . name == 'nt' :
    filelist_dir = "C:/Temp/fl_logdir_dkfjpoiwqjefkdjf/"  #for windows
else :
    filelist_dir = '/tmp/fl_logdir_dkfjpoiwqjefkdjf/'    #for linux

这些参数对于您的意愿运行至关重要

bucket_name ：输入您的存储桶名称
session = boto3.session（profile_name ='sbe1'） ：输入aws profile名称
target_path ：要转移到雪球的输入目录路径
- 如果target_path ='/move/to/s3/onerourt/'，它将移至s3：//'bucket_name'/move/to/s3/s3/oneration/
- 如果target_path ='。'，它将移至s3：//'bucket_name'/
- 因此，执行命令snowball_uploader并修复target_path非常重要
- 我建议您在应用数据之前先用示例数据测试脚本。
max_tarfile_size ：焦油文件大小将上传到雪球
- 该值应低于100 GB
- Snowball_uploader存档文件以滚动滚动的文件，该焦油文件将自动提取。
- metadata = {“雪球 - 自动提取”：“ true”} ，将此元数据添加到焦油文件中。
- 雪球限制参考：https：//docs.aws.amazon.com/snowball/latest/developer-guide/batching-small-files.html
max_part_size ：最大多部件大小，滚雪球限制最大零件大小为512MB
- 该脚本使用S3的多部分UPLOAD功能将文件汇总到一个大焦油文件中
- 雪球限制参考：https：//docs.aws.amazon.com/snowball/latest/ug/limits.html
min_part_size ：最小零件尺寸，雪球限制min-multi-part大小为5MB
- 参考：https：//docs.aws.amazon.com/snowball/latest/ug/limits.html
max_process ：并发过程的数量， snowball_uploader使用多个进程来提高上传速度
FILELIST_DIR ：FILELIST文件生成的位置
- /tmp/fl_logdir_dkfjpoiiwqjjefkdjf/Directory已固定，每当您使用GenList参数运行脚本时，该目录都会删除并重新创建。

Genlist

ec2-user > python3 snowball_uploader.py genlist

GenList参数生成包含原始文件和目标文件的清单文件。该参数应在应对文件之前运行。

要传输文件列表的文件列表列出了文件之和以修复焦油文件大小，最大焦油文件大小应低于100GB。

ec2-user > ls /tmp/fl_logdir_dkfjpoiwqjefkdjf
fl_1.yml fl_2.yml fl_3.yml fl_4.yml fl_5.yml

文件列表的内容

ec2-suer > cat f1_1.yaml
- ./snowball_uploader_11_failed.py: ./snowball_uploader_11_failed.py
- ./success_fl_2.yaml_20200226_002049.log: ./success_fl_2.yaml_20200226_002049.log
- ./file_list.txt: ./file_list.txt
- ./snowball-fl_1-20200218_151840.tar: ./snowball-fl_1-20200218_151840.tar
- ./bytesio_test.py: ./bytesio_test.py
- ./filelist_dir1_10000.txt: ./filelist_dir1_10000.txt
- ./snowball_uploader_14_success.py: ./snowball_uploader_14_success.py
- ./error_fl_1.txt_20200225_022018.log: ./error_fl_1.txt_20200225_022018.log
- ./snowball_uploader_debug_success.py: ./snowball_uploader_debug_success.py
- ./success_fl_1.txt_20200225_022018.log: ./success_fl_1.txt_20200225_022018.log
- ./snowball_uploader_20_thread.py: ./snowball_uploader_20_thread.py
- ./success_fl_1.yml_20200229_173222.log: ./success_fl_1.yml_20200229_173222.log
- ./snowball_uploader_14_ing.py: ./snowball_uploader_14_ing.py

清单文件以yaml格式编写
左键是原始文件名
正确的值是目标文件名，如果要更改S3上的文件名，则可以使用Rename_file方法更改它。

 def rename_file ( org_file ):
    target_file = org_file  ##
return target_file

cp_snowball

cp_snowball参数将把文件传输到雪球
脚本运行时，它会创建两个日志文件，Success_'file_name'timestamp'.log和错误'file_name'_''timestamp'.log
- success_'file_name'_'timestamp'.log：它包含成功存档到tarfile的文件的名称
- error_''file_name'_'Timestamp'.log：它包含文件系统中不存在的文件的名称，即使用filelist编写。
- 使用这些日志，您可以检查哪些是转移的，哪些是什么。

它如何工作

    #print ('n')
    print ( 'genlist: ' )
    print ( 'this option will generate files which are containing target files list in %s' % ( filelist_dir ))
    #print ('n')
    print ( 'cp_snowball: ' )
    print ( 'cp_snowball option will copy the files on server to snowball efficiently' )
    print ( 'the mechanism is here:' )
    print ( '1. reads the target file name from the one filelist file in filelist directory' )
    print ( '2. accumulates files to max_part_size in memory' )
    print ( '3. if it reachs max_part_size, send it to snowball using MultiPartUpload' )
    print ( '4. during sending data chunk, threads are invoked to max_thread' )
    print ( '5. after complete to send, tar file is generated in snowball' )
    print ( '6. then, moves to the next filelist file recursively' )

结论

我不是专业程序员，所以它可能存在一些缺陷，错误处理非常差。如果您设置了太高数量的参数（max_threads，max_part_size和max_tarfile_size），则此脚本可能会消耗大量内存，然后会导致系统的冻结。因此，使用样本数据进行了几次测试。当我在客户网站中使用它时，它会在10次中减少消耗时间。希望您也可以从此脚本中获得帮助。

展开

附加信息