To follow up on my last article about Linux on the ASUS T100TA, I recently acquired (for about $150) an
ASUS C201 Chromebook, with a quad-core (1.8ghz?) ARM processor, 4GB RAM, and a tiny 16GB SSD. This is the first time I've used a Chromebook, and ChromeOS feels not-so-bad. I wish we could target it directly!
...but we can't! At least, not without going through Javascript/WebAssembly/whatever. Having said that, one can put it in developer mode (which isn't difficult but also is sort of a pain in the ass, especially when it prompts you whenever it boots to switch out of developer mode, which if you do will wipe out all of your data, ugh). In developer mode, you can use
Crouton to install Linux distributions in a chroot environment (ChromeOS uses a version of the Linux kernel, but then has its own special userland environment that is no fun).
I installed Ubuntu 16.04 (xenial) on my C201, and it is working fine for the most part! It's really too bad there's no easy way to install Ubuntu completely native, rather than having to run it alongside ChromeOS. ChromeOS has great support for the hardware (including sleeping), whereas when you're in the Ubuntu view, it doesn't seem you can sleep. So you have to remember to switch back to ChromeOS before closing the lid.
So I built REAPER on this thing, fun! And I still have a few GB of disk left, amazingly. Found a few bugs in EEL2/ARM when building with gcc5, fixed those (I'm now aware of __attribute__((naked)), and
__clear_cache()).
Some interesting performance comparisons, compiling REAPER:
- C201 (gcc 5.4): 9m 7s
- T100TA (gcc 6.3): 8m 45s
- Raspberry Pi 3 w/ slow MicroSD (gcc 4.7): 28m
REAPER v5.50rc6 (48khz, 256 spls, stock settings), "BradSucks_MakingMeNervous.rpp" from old REAPER installers -- OGG Vorbis audio at low samplerates, a few FX here and there, not a whole lot else:
- C201: 28% CPU, 13% RT CPU, 15% FX CPU, longest block: 1.5ms
- T100TA: 22% CPU, 9% RT CPU, 10% FX CPU, longest block 0.9ms
(The T100TA's ALSA drivers are rough, can't do samplerates other than 48khz, can't do full duplex...)
Overall both of these cheapo laptops are really quite nice, reasonably usable for things, nice screens, outstanding battery life. If only the C201 could run Linux directly without the ugly ChromeOS developer-mode kludge (and if it had a 64GB SSD instead of 16GB...). Also, I do miss the T100TA's charge-from-microUSB (the C201 has a small 12V power supply, but charging via USB is better even if it is slow).
I'll probably use the T100TA more than the C201 -- not because it's slightly faster, but because I feel like I own it, whereas on the C201 I feel like I'm a guest of Google's (as a side note, apparently you can
install a fully native Debian, but I haven't gotten there yet.. The fact that you have to use the kernel blob from ChromeOS makes me hesitate more, but one of these days I might give it a shot).
4 Comments
I've been working on a REAPER linux port for a few years, on and off, but more intensely the last month or two. It's actually coming along nicely, and it's mostly lot of fun (except for getting clipboard/drag-drop working, ugh that sucked ;). Reinventing the world can be fun, surprisingly.
I've also been a bit frustrated with Windows (that crazy defender/antispyware exploit comes to mind, but also one of
my Win10 laptops used to update when I didn't want it to, and now won't update when I do), so I decided to install
linux on my T100TA. This is a nice little tablet/laptop hybrid which I got for $200, weighs something like 2 pounds, has a quad core
Atom Bay Trail CPU, 64GB of MMC flash, 2GB of RAM, feels like a toy, and has a really outstanding battery life (8 hours
easily, doing compiling and whatnot). It's not especially fast, I will concede. Also, I cracked my screen, which
prevents me from using the multitouch, but other than that it still works well.
Anyway, linux isn't officially supported on this device, which boots via EFI, but following this guide worked on the first try, though I had to use the audio instructions from
here. I installed Ubuntu 17.04 x86_64.
I did all of the workarounds listed, and everything seemed to be working well (lack of suspend/hibernate is an obvious shortcoming, but it booted pretty fast), until the random filesystem errors started happening. I figured out that the errors were occurring on read, the most obvious way to test would be to run:
debsums -c
which will check the md5sum for the various files installed by various packages. If I did this with the default configuration, I would get random files failing. Interestingly, I could md5sum huge files and get consistent (correct results). Strange. So I decided to dig through the kernel driver source, for the first time in many many years.
Workaround 1: boot with:
sdhci.debug_quirks=96
This disables DMA/ADMA transfers, forcing all transfers to use PIO. This solved the problem completely, but lowered the transfer rates down to about (a very painful) 5MB/sec. This allowed me to (slowly) compile kernels for testing (which, using the stock ubuntu kernel configuration, meant a few hours to compile the kernel and the tons and tons of drivers used by it, ouch. Also I forgot to turn off debug symbols so it was extra slow).
I tried a lot of things, disabling various features, getting little bits of progress, but what finally ended up fixing it was totally simple. I'm not sure if it's the correct fix, but since I've added it I've done hours of testing and haven't had any failures, so I'm hoping it's good enough.
Workaround 2 (I was testing with 4.11.0):
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -2665,6 +2665,7 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
*/
host->data_early = 1;
} else {
+ mdelay(1); // TODO if (host->quirks2 & SDHCI_QUIRK2_SLEEP_AFTER_DMA)
sdhci_finish_data(host);
}
}
Delaying 1ms after each DMA transfer isn't ideal, but typically these transfers are 64k-256k, so it shouldn't cause too many performance issues (changing it to usleep(500) might be worth trying too, but I've recompiled kernel modules and regenerated initrd and rebooted way way too many times these last few days). I still get reads of over 50MB/sec which is fine for my uses.
To be properly added it would need some logic in sdhci-acpi.c to detect the exact chipset/version -- 80860F14:01, not sure how to more-uniquely identify it -- and a new SDHCI_QUIRK2_SLEEP_AFTER_DMA flag in sdhci.h). I'm not sure this is really worth including in the kernel (or indeed if it is even applicable to other T100TAs out there), but if you're finding your disk corrupting on a Bay Trail SDHCI/MMC device, it might help!
6 Comments
After reading an article on hacker news about distrokid.com, I realized I could put all of my many hundreds of hours of recorded music on Spotify, Apple Music, Google Play, Amazon, etc etc etc. For $20/year. I wouldn't expect to earn any money whatsoever for my recordings, but it would be entertaining.
So, as a result, step 1 of this process is complete -- which is to say that I put together an album from many of my recent Super8/REAPER-produced recordings. 18 of them, to be exact, totalling about 45 minutes of music, with words. Every song has words. As a result, I titled this album "Songs with Words".
This album, which is incredible in the 2017 sense only, is available via
Spotify,
Apple Music,
Google Play, probably others too (search for my name in the respective service) -- and of course it is available for free in streamable/downloadable form via
music.1014.org, or here:
More albums will probably soon follow, one or two volumes of instrumentals will be next.
Recordings:
cory_andre_anette_aubrey - 1 -- [3:12]
cory_andre_anette_aubrey - 2 -- [4:01]
cory_andre_anette_aubrey - 3 -- [5:43]
cory_andre_anette_aubrey - 4 -- [6:39]
cory_andre_anette_aubrey - 5 -- [4:05]
cory_andre_anette_aubrey - 6 -- [5:19]
cory_andre_anette_aubrey - 7 -- [4:34]
cory_andre_anette_aubrey - 8 -- [5:16]
cory_andre_anette_aubrey - 9 -- [5:35]
cory_andre_anette_aubrey - 10 -- [4:42]
cory_andre_anette_aubrey - 11 -- [3:13]
cory_andre_anette_aubrey - 12 -- [5:21]
cory_andre_anette_aubrey - 13 -- [5:01]
cory_andre_anette_aubrey - 14 -- [2:59]
cory_andre_anette_aubrey - 15 -- [4:09]
cory_andre_anette_aubrey - 16 -- [6:17]
cory_andre_anette_aubrey - 17 -- [7:37]
cory_andre_anette_aubrey - 18 -- [13:45]
1 Comment
TL;DR: Retina iMac (4k/5k) owners can greatly improve the graphics performance of many applications (including REAPER) by setting the color profile (in System Preferences, Displays, Color tab) to "Generic RGB" or "Adobe RGB." (and restarting REAPER and/or other applications being tested)
I previously wrote in mid-2014 about the state of blitting bitmaps to screen on modern OS X (now macOS) versions. Since then, Apple has released new hardware (including Retina iMacs) and a couple of new macOS versions.
Much of that article is still useful today, but I made a mistake in the second update:
OK, if you provide a bitmap that is twice the size of the drawing rect, you can avoid argb32_image_mark_RGBXX, and get the Retina display to update in about 5-7ms, which is a good improvement (but by no means impressive, given how powerful this machine is). I made a very simple software scaler (that turns each pixel into 4), and it uses very little CPU.
While this was helpful (and did decrease the amount of time spent blitting), it was wrong in that the reason for the faster blit was that the system was parallelizing the blit with multiple cores. So, it was faster, but it also used more CPU (and was generally wasteful).
I discovered this because I've been researching how to improve REAPER's graphic performance on the iMac 5k in particular, so I started benchmarking. This time around, I figured I should measure how many screen pixels are updated and divide that by how long it takes. Some results, based on my memory (I'm not going to rerun them for this article, laziness).
Initial version (REAPER 5.32 state, using the retina hack described above, public WDL as of today):
- old C2D iMac, 10.6: 350MPix/sec
- mid-2012 RMBP 15", 10.12, Thunderbolt display (non-retina): 1500MPix/sec
- mid-2012 RMBP 15", 10.12, built-in display (retina): 800MPix/sec
- late-2015 Retina iMac 5k, 10.12: 192MPix/sec
The one that really jumped out at me was the Retina iMac 5k -- it's a quarter of the speed of the RMBP! WTF. We'll get to that later.
After I realized the hack above was actually doing more work (thank you, Xcode instrumentation), I did some more experiments, avoiding the hack, and found that in the newer SDKs there are kCGImageByteOrderXYZ flags (I don't believe it was in previous SDKs), and found that these alised to KCGBitmapByteOrderXYZ, and that when using
kCGBitmapByteOrder32Host with the pixel format for CGImageCreate()/etc, it would speed things up.
With retina hack removed:
- mid-2012 RMBP 15", 10.12, built-in display (retina): 300MPix/sec
- late-2015 Retina iMac 5k, 10.12: 152MPix/sec
With retina hack removed and byte order set to host:
- old C2D iMac, 10.6: 350MPix/sec
- mid-2012 RMBP 15", 10.12, Thunderbolt display (non-retina): 1500MPix/sec
- mid-2012 RMBP 15", 10.12, built-in display (retina): 720MPix/sec
- late-2015 Retina iMac 5k, 10.12: 200MPix/sec
The non-retina displays might have changed slightly, but it was insignificant. So, by setting the byte order to native, we get the Retina MBP close to the level of performance of the hack, which isn't great but is serviceable, and at least the CPU use is decreased. This also has the benefit (drawback?) of making the byte-order of pixels the same on macOS/Intel and win32, which will take some more attention (and a lot of testing).
From profiling and looking at the code, this blit performance could easily be improved by Apple -- the inner loop where most time is being spent does a lot more than it needs to. Come on Apple, make us happy. Details offered on request.
Of course, this really doesn't do anything for the iMac 5k -- 200MPix/sec is *TERRIBLE*. The full screen is 15 megapixels, so at most that gets you around 13fps, and that's at 100% CPU use. After some more profiling, I found that the function chewing the most CPU ended in "64". Then it hit me -- was this display running in 16 bits per channel? A quick google search later, it was clear: the Retina iMacs have 10-bit displays, and you can run them in 10 bits per channel, which means 64 bits per pixel. macOS is converting all of our pixels to 64 bits per pixel (I should also mention that it seems to be doing a very slow job of it). Luckily, changing the color profile (in system preferences, displays) to "Generic RGB" or similar disables this, and it gets the ~800MPix/sec level of performance similar to the RMBP, which is at least tolerable.
Sorry for the long wordy mess above, I'm posting it here so that google finds it and anybody looking into why their software is slow on macOS 10.11 or 10.12 on retina imacs have some explanation.
Also please please please Apple optimize CGContextDrawImage()! I'm drawing an image with no alpha channel and no interpolation and no blend mode and the inner loop is checking each pixel to see if the alpha is 255? I mean wtf. You can do better. Hell, you've done way better. All that "new" Retina code needs optimizing!
Update a few hours later:
Fixing various issues with the updated byte-ordering, CoreText produces quite different output for CGBitmapContexts created with different byte orderings:
Hmph! Not sure which one is "correct" there... hmm... If you use kCGImageAlphaPremultipliedFirst for the CGBitmapContext rather than kCGImageAlphaNoneFirst, then it looks closer to the original, maybe. ?
Also other caveat: NSBitmapImageRep can't seem to deal with the ARGB format either, so if you use that you need to manually bswap the pixels...
Update (2019): SolvedWorked around most of this issue by using Metal, read here.
4 Comments