I recently worked with a client that had about 200GB of data, mostly smaller files like images or PDFs, stored on a web server. I used BitTorrent Sync to get all of that data from that server to a computer, with very little effort. When I began working with them, they didn’t have any backups of the data at all – their webhost did daily backups, but they were pretty much inaccessible to us, besides the ability to put in a request for a backup to be restored. My initial plan was simple: to purchase a small, low power computer, with a 4TB hard drive in it, far more than enough for the 200GB of data and for future growth. I would then get the data from the server to the computer, and deliver the computer to the client’s office. They would then have a backup of all of their data on-site.
The question is, what software solution would be able to sync that amount of data, quickly and easily?
- The Requirements
- The Options
- The Decision – BitTorrent Sync
I had several requirements for this project, each of which were non-negotiable. It would later turn out that BitTorrent Sync would actually solve each of the issues quite well, although it did have a couple small drawbacks that needed to be mitigated.
- Free (or cheap) – the solution would need to be extremely low-cost. It seems like this should be possible – the entirety of the internet is just bits flying around, so why wouldn’t I be able to move a huge amount of data from one computer to another for free?
- Fast – I long for the future where 200GB of data is laughed at as being ridiculously small. In the mean time, a slow connection could make this size of data transfer take weeks.
- Real-time – Ideally, every time a file gets changed on the server, it should get changed on the backup computer as well. This would prevent me from having to write a backup script to get changed files on a regular basis.
- Secure – Although my client doesn’t have to comply with strict data security regulations like HIPAA, I still don’t want my client’s data available on the larger internet. It belongs to my client, and to no one else.
I came up with a number of ideas, none of which were particularly good, but any of which may have worked. My first idea was to use DropBox – an easy-to-use, free service with which I was very familiar. The plan was to use a free DropBox account, which gets me 2GB of data, and use it to transfer only the new files nightly. The client is using a custom-written application on their server, and luckily for me the program is still being developed. I was able to get the developer to copy all new files not only to the existing data directory but also to a new directory they made for me – I was going to tie that to DropBox, so that all future data would be synced between the server’s DropBox folder and the computer’s DropBox folder. I would then have a script running every few minutes on the computer that would pull the data out of that folder and put it elsewhere – leaving the DropBox folder as empty as possible for as much time as possible.
This probably would have worked – the initial sync would still be a problem, but I had another solution for that – just dump all the files into DropBox anyway. It only has 2GB of space, but DropBox doesn’t just die when you hit that point – it syncs 2GB and no more. Therefore, I could have dumped all the data in there and let 2GB of it sync, then every minute have the desktop pull data out, allowing for more of it to sync. It’s certainly not an elegant solution, but once it was done, I could continue the process with only the changed files, which would add up to under 2GB daily. There’s also a potential concern for how DropBox would act if you put in a single file whose size is greater than 2GB – it wasn’t a concern with my data, which was a large number of tiny files. However, whether DropBox would consistently allow large files to sync would definitely be a problem worth researching.
This plan was free, in that I would not be paying for a DropBox for Business account. It certainly wouldn’t be fast, because it could only sync in 2GB chunks – although, I could potentially get the data moving in and out relatively quickly (but that might have gotten the account flagged for suspicious activity). It would be relatively real-time – the files would be copied to the Dropbox directory as soon as they were uploaded to the server, and from there would begin syncing with the computer immediately. It would be secure – Dropbox is pretty secure as a platform.
FileZilla – FTP
Another idea would be to use FTP, or SFTP. We already had a FileZilla FTP server up and running on the server for ease of file transfer. I could have allowed the entire data directory to be available via FTP – and then have a script running on the desktop that checks for new files, or keeps an index of files and checks for changes periodically.
This solution would also be free, given that both the FileZilla client and server are free. However, the programming aspect of polling an FTP server really turned me off from the whole idea. In terms of speed, the FTP server/client combination can pretty easily use up all available bandwidth, which is nice – but the true speed would be based on the efficiency of the polling script. It would indeed be real-time, in that the directory itself would be accessible via the FTP server. In terms of security, it can be fairly easily set up with SSL/TLS for security, which is a big plus.
CrashPlan is another product with which I’m familiar – you can see my review of the Family plan here. It’s a fairly robust program, and the free option lets you use the program to sync two computers, which is nice. If you pay, you get an additional sync destination, which is their cloud backup service. However, based on my review, the program itself simply isn’t performant – as the number of files increases, the RAM requirement increases. This means that although the program itself is free, it would be eating up precious resources from the server, and never giving them back. So long as CrashPlan is running, it will require a ton of RAM. So it’s free, but not really.
Speed is also a major issue for CrashPlan – as I mention throughout my review, CrashPlan can be extremely slow, with little to no explanation. The only support they offer is basically to restart your router (seriously). In a home environment, that’s incredibly unhelpful – and in a business environment, impossible. I wouldn’t interrupt my client’s internet access because the solution I chose got slow. The solution would be real-time, in that the folders linked by CrashPlan would be synchronized, and changes would be propogated from the server to the PC. Unfortunately, due to the slow speed of propogation, real-time is relative.
In terms of security, CrashPlan is pretty secure. Their free version uses the same Blowfish encryption as their paid version, but the free version uses a key strength of 128-bit while their paid version uses 448-bit. That’s kind of annoying, but they’re a business so I’m lucky they’re giving any part of their program away for free, I guess. Still, it’s another reason not to go with them as an option.
The Decision – BitTorrent Sync
In the end, I went with BitTorrent Sync. This option has everything I need – in terms of price, it’s completely free. In terms of speed, BitTorrent Sync is able to eat up 100% of my available upload bandwidth on my server, or 100% of my download bandwidth on my desktop, whichever is the bottleneck. It also gives me the ability to limit the inbound or outbound bandwidth so it’s not causing problems for other services. In terms of real-time connectivity, so long as both the server and the desktop are running the BitTorrent Sync service, pretty much as soon as I put a file in on the server I see it begin to sync on the desktop. Works perfectly.
Security is an interesting aspect. In order to connect the server and the desktop, you first set up a share on the server, and then you get a private key for that share. You can then insert that private key on the desktop, and it will begin syncing. According to NetworkWorld.com’s analysis of the Hackito article on BitTorrent Sync security, it shouldn’t be used for sensitive data for a number of reasons. Now, on the one hand, I’m not using it for inherently sensitive data. On the other hand, I have mitigation protocols in place for any of the issues mentioned at the end of the NetworkWorld article.
- “GetSync.com server receives many (all?) hashes in clear-text when sharing the directory; it is used to share links amongst people, even though the previous BTsync hash sharing mechanism was better for security.”
- This is a fair criticism – the hashes that are used to share a directory are sent in clear-text, and therefore could potentially be compromised and stolen. However, I still have the bandwidth monitoring in place – if somehow someone managed to get that key, they wouldn’t get very much data from the server before it could be shut down.
- “There was a change of Sync’s sharing paradigm after the first releases that introduced a vulnerability, which “may be the result of NSL (National Security Letters, from US Government to businesses to pressure them in giving out the keys or introducing vulnerabilities to compromise previously secure systems) that could have been received by BitTorrent Inc and/or developers.”
- Another fair criticism – the government can potentially demand that BitTorrent hand over my key. Again, the government still wouldn’t get very much data from my server before I could turn the BitTorrent Sync service off, or before it could be turned off automatically. It would actually be a smarter decision by the government to just demand that my webhost hand over the login information to my server in the first place, which I can’t really protect against (and theoretically, no one can).
- “Leak about the private network addresses of clients that gives indication about where and what to attack.”
- If the hackers don’t know anything about your private network’s infrastructure, it would be much harder to hack – that being said, there’s nothing particularly interesting in my client’s private network that would be of any use to hackers. If they really wanted to have some fun, they’d go through the webhost’s private network, and me running BitTorrent Sync doesn’t affect that very much.
- There are “probable multiple vulnerabilities in the clients.”
- You can read the entire Hackito article to get more information about these “probable” vulnerabilities, but basically, they just don’t scare me. Would I put a massive number of credit card numbers and social security numbers in BitTorrent Sync? Probably not. But a bunch of images and PDF’s? If some hackers want to read them all, they’ll be very, very bored.
The choice of BitTorrent Sync as the synchronization solution comes with a number of bonus features that are pretty cool. The first is that the synchronization is based on torrent technology – so far, my whole use case has been about syncing one server and one desktop. If I were to get a second desktop, however, it would join the pool of syncing computers. Both the server and the existing desktop could act as a “seed”, providing data to the new desktop. Because of this, the files would be downloading from two sources instead of just one, theoretically doubling the download speed. More realistically, if there was an upload bottleneck on the side of either the existing desktop or the server, that bottlenecked speed would be supplemented by the additional speed of the second seed.
Another great set of options is the Archive. One potential downside of using a truly real-time synchronization option is that if, for some reason, all of the files get deleted from the server, all of those changes would then be propogated to the desktop, which would synchronize by deleting all of its files as well. The good news is that BitTorrent Sync has a section called the Archive, which holds certain files even after they get deleted. You can customize the settings for how long files are kept and what size the Archive can be – I have those settings pretty high, so that the entire 200GB of data can sit in the Archive for up to 30 days, if need-be. Since I would most likely be getting panicked phone calls the same day that everything got deleted, 30 days is overkill – but it’s nice to know it’ll all be there.
I mentioned that when setting up the relationship between the two BitTorrent Sync clients, I needed to share a key. The cool thing is, there are actually two different kinds of keys – a standard key, and a read-only key. A standard key keeps both folders synchronized, meaning that if I make a change on the server or the desktop, that change will be sent over to the desktop or the server. But I don’t want changes on the desktop to be sent to the server – if for some reason the desktop gets messed with, the server needs to continue to work properly. By providing the desktop with a read-only key, it is able to “read” the changes made on the server and synchronize with it, but any changes made on the desktop will not be “written” to the server. I can rest assured that even though I’m putting the desktop in my client’s office, even if someone really messed with it, it would not affect the server at all.
BitTorrent Sync does have a couple of drawbacks, though. Although I have the client software set up on both the server and the desktop, the actual syncing server is proprietary. Without delving into too much detail about how torrents work, BitTorrent owns the tracker server, and I can’t run my own. Basically, the tracker server keeps track of which computers have which data, and lets them know about each other. So, when my PC uses the private key provided by my server, the tracker server pairs them up so they can begin syncing.
The issue is one of security – how do I really know that BitTorrent isn’t giving away that private key elsewhere? How do I know that they don’t have their own backdoor in the program, and are therefore able to get at all of my files? There are a few reasons why I feel safe, even with those questions. The first is that the software itself lets me restrict the IP addresses that are allowed to get my files. This means that even if someone else had my private key, they wouldn’t have the correct IP address. The desktop will be at my client’s office, which has static IP, so I simply set that IP address as the “predefined host” and I’m good to go.
Of course, for this to be truly secure, I would have to trust the BitTorrent Sync software when it says that it’s using only the predefined hosts. I could monitor outbound connections to determine if any services running on the server are connecting to unknown IP addresses – however, because my backup is sizeable, I have an even easier option: bandwidth monitoring. I disabled it during the inital sync of the 200GB, but re-enabled it after it was done. Now, even if another computer manages to find my secret key and spoof the IP address of my client, I’d be notified by the webhosting company if my server started sending out massive amounts of data. Even if the program itself is untrustworthy (and admittedly, mentioning using torrent technology in a production environment may get you laughed at), it still wouldn’t be too much of a problem – perhaps they could get at a little bit of the data before I got the notification, but they most certainly couldn’t get all 200GB before I could react (and in the future, I will be scripting this reaction).