Given our project's constraints, this was such a non-trivial problem, so I'm writing this blog post to go over what not to do and how I solved it.
The problem
We need to convert Photoshop files stored in S3 to JPGs. These files are big in size, up to 75GB.
The constraints
We don't want to use an always-on EC2 instance with a lot of RAM; given how few assets we get today, that would be a waste of money. We ideally want to keep the process serverless. The biggest issue with using Lambdas is that we can't load these files into memory. Lambdas can only have up to 10GB of RAM.
Given that we can't load these files into memory, we need to read as little of the PSD file as possible to get the exported image. I suspected that the Photoshop format would allow this by having the data we needed at the beginning of the file, and I was right. To do this in the command line, you can run ImageMagick (Magick for short) as such:
magick 'image.psd[0]' image.png
So why does this work? Internally, Magick processes each Photoshop layer as an array of sequences. So, the index 0 in the earlier command tells Magick to convert only the first layer. This works because the first layer is a merge of all other layers. To double-check, I profiled this command locally and made sure it reads very little of the file. Doing the same thing using Wand in Python looks like this:
from wand.image import Image with Image(filename="image.psd[0]") as psd: with psd as first_layer: first_layer.save(filename="image.jpeg")
With this proof of concept out of the way, all I had to do was make this work in a Lambda, right?
What didn't work
The solution is as important as the steps it took to get to it. I don't want another soul to try all these other paths I did before arriving at the same solution. So below are all the things that didn't work:
You can't stream the file from S3:
So far, I've been able to read the top part of the file from disk; it's fair to assume you can do the same by streaming it from S3. I've tried so many things, but none of it worked. boto3 does support streaming, but for some reason, Wand refuses to use it when it's given a File object or a blob. Wand's Image constructor doesn't support indexes, so after some searching online, you'd find people saying this should work:
from wand.image import Image with Image(file=s3_file, format="PSD[0]") as psd: with psd as first_layer: first_layer.save(filename="image.jpeg")
Sadly, it doesn't, but it maybe did in an older version of Wand. The explicit approach in the code below doesn't work as well. Wand loads the entire file whenever the sequence array is accessed.
from wand.image import Image with Image(file=s3_file) as psd: with psd.sequence[0].clone() as first_layer: with first_layer.convert("jpeg") as converted: converted.save(filename="image.jpeg")
Mounting S3 as a file system doesn't work:
This idea might initially sound strange, but utilities like s3fs or s3fs-fuse exist. s3fs-fuse is out of the question since I can't run command line installations in a Lambda, let alone mount a storage volume. As for s3fs, a Python library, it fails because the only way to access files is through their own S3File object. Magick doesn't like this object the same way it didn't like boto3's file object.
Copying S3 files into storage mounted to the Lambda
Given the above, I now have no choice but to copy the S3 files into some storage mounted to the Lambda. I can't use Lambdas's ephemeral storage because it has another 10GB limit. So, Elastic File System (EFS) it is.
I won't go into much detail about how to set up EFS and mount it to a Lambda; there are plenty of good tutorials about it out there. I'll say that using EFS for this is great because it means I can access these files in different Lambda invocations without copying them from scratch.
Searching how to copy S3 files into EFS will point you to AWS's DataSync. While DataSync works as advertised, it's a bit of an overkill. DataSync is meant for much bigger data-moving jobs, and the minimum schedule it allows is once per hour. This is way too slow of a turnaround to process the uploaded files.
So, we did this by triggering a Lambda that copies the file into EFS when it gets uploaded to S3. There is a lot of speculation online about how fast a Lambda can download from S3. From my testing, giving the Lambda enough RAM for the download buffer to breathe, plus having everything in a VPC with an S3 Endpoint, is very fast. A 2GB file downloads in 0.6s. Even at a very conservative estimate of 250Mbs/s, a 75GB file would take 5 minutes, which is fine for our use case.
Runnig Wand inside a Lambda
Quoting from their page, "Wand is a ctypes-based simple ImageMagick binding for Python." This means that Wand requires ImageMagick to be installed on the machine it is running on. The next logical question is how can we install programs on a Lambda. We can't; Lambda environments run in Amazon Linux 2 containers without installation permissions.
However, we have a permanent EFS volume that our Lambda has access to. This means, in theory, we can keep a portable Magick installation in the EFS volume.
Magick's download page has a portable AppImage but it doesn't work with Wand. Wand on Linux needs access to Magick's .so files to work. Someone with good knowledge about AppImages would say I should extract it to get these files. I didn't know what AppImages were before doing this. So I went through many rabbit holes that involved building the library myself with specific Make and Configuration flags. There is no need to do so; all you need to do is run the command below from an EC2 instance that is running Amazon's Linux:
ImageMagick-7.1.1-28.AppImage --appimage-extract
This command will create a squashfs-root directory with most of the files the portable install needs.
I said most because some .so files are not on Amazon's Linux, but Magick needs them. To not hardcode the article, I'll say Google whatever .so file your Amazon Linux tells you it doesn't have and install it.
Once the portable install works, run the following command to list all the dynamic dependencies the Magick executable is referencing.
ldd squashfs-root/usr/bin/magick
The output should look something like this:
... libxml2.so.2 => /squashfs-root/usr/bin/./../lib/libxml2.so.2 (0x00007f06c32a4000) libz.so.1 => /lib64/libz.so.1 (0x00007f06c328a000) ...
Find all the .so files not in the squashfs-root directory and copy them to a directory because we need to put them in the Wand Lambda Layer.
For these .so files to get referenced correctly, they need to exist in one of the paths listed under the
LD_LIBRARY_PATH environment variable.
As of the time of writing this, the LD_LIBRARY_PATH variable in a Python Lambda looks like this:
/var/lang/lib:/lib64:/usr/lib64:/var/runtime:/var/runtime/lib:/var/task:/var/task/lib:/opt/lib
Lucky for us, it has /opt/lib at the end.
Python Lambda Layers get extracted under the /opt directory, which means we can extract files into the /opt/lib directory by putting them in the /lib directory in our Lambda Layer.
Lastly, set the MAGICK_HOME environment variable to where your extracted AppImage is given the EFS mount point. Mine looked like this:
MAGICK_HOME=/mnt/efs/squashfs-root/usr
Other ways
I know the entire portable installation can live in a Lambda Layer; it fits within the 250Mb limit. I went with the EFS route because it worked first, and I don't need to do this extraction or process often. I've also seen where people automate the creation of such Layers using a Docker container running Amazon Linux.
Final Remarks
This was a long process, but I learned a lot. I wish I knew what AppImages were; it would've saved me a few days.
Sorry for the inconsistent code highlighting; it seems that Blogstatic, where this blog is hosted, is using highlight.js's automatic language detection, which is inconsistent for some reason.