Official Blog

This blog is a compilation of thoughts I have had over the course of the past few months, based on and triggered by newspaper articles as well as comments from customers and partners. The very vast majority of the blog will be about HPC and what is not working as well as it could. I'll include news and my opinions on the suppliers into HPC, industry outlooks, tech trends and economic projections if I feel comfortable with those. From time to time, I might bring up things that in general within the framework of IT annoy me, and of course I'll include news and my comments on IT security, privacy and authentication when appropriate.

I will try my best to keep it current and relevant.

FMS renames itself

Posted on by: Axel Kloth

The Flash Memory Summit that for 18 years has presented the state of the art and advances in Flash Memory has decided to rename itself to properly reflect its extended scope beyond Flash. Expect to see DRAM, Flash, Phase Change Memory and other novel main memory technologies, and novel and updated existing mass storage technologies. While the acronym remains FMS, it is now The Future of Memory and Storage.

The old FMS URL still works and will be kept for a while, but FutureMemoryStorage.com is the new FMS URL.

Make sure to sign up to see the exciting new developments in The Future of Memory and Storage.

RISC-V Summit 2023

Posted on by: Axel Kloth

This years' RISC-V Summit at the Santa Clara Convention Center in California prove that RISC-V is not only here to stay, it is growing. The ecosystem around it is expanding at a nice pace as well. There are more companies offering RISC-V cores, processors, boards, computers, handhelds, laptops and even servers than last year. More compilers, debuggers and other tools and operating systems as well as hypervisors are available, and tools for the verification of ISA compliance as well as the CPU design itself are becoming mainstream.

What was surprising to me was that there was a good number of end users who showed interest, and that is what usually starts commercial success of a platform and an ISA. While x86-64 is not going to go away any time soon with its installed base of software that can't easily be replaced, ARM is being pushed aside by RISC-V in several areas. IOT and IIOT are probably the biggest areas in which RISC-V has potential to completely displace ARM as neither x86-64 nor ARM have developed a stronghold there yet. Next up might be feature phones and possibly tablets to feature RISC-V and threaten ARM. The OSes now exist, and with Android on RISC-V a reality, cost-sensitive feature phones might switch to the RISC-V platform entirely. A preview of all of this was available at the Summit.

It may have been coincidence, but Silicon Angle reported that Arm’s stock sinks on lower guidance following first post-IPO earnings call. I had posted a blog entry not too long ago and pointed out what I had thought about the ARM IPO and my reasons for why it was overpriced and overhyped. Reality caught up very quickly with ARM and its leadership. To some degree, RISC-V has already impacted ARM and its stock price.

Large Language Models

Posted on by: Axel Kloth

As the name "Large Language Model" implies, generating an LLM requires large input data to feed into a Generative Pre-trained Transformer. The generation of an AI model for a LLM in the backend is a computationally hard and memory-intensive application, both in terms of bandwidth and latency requirements. CPUs do not have enough cores to effectively execute those functions, and GPGUS are neither general-purpose nor do they have an efficient and effective way to directly communicate with each other. As such, the direct interconnect is missing, and another portion that is missing is a scale-out port to connect many of them together at maximum bandwidth and minimum latency. These limitations are known as the memory wall, the von-Neumann bottleneck and the Harvard architecture limits.

We overcome all of them. That technology is now patent-pending.

Beyond Harvard CPU Architecture

Posted on by: Axel Kloth

For decades now, people have complained about the von-Neumann bottleneck (input – processing – output with some instruction and temporary data I/O to and from memory). The only suggested solution was the Harvard architecture, separating out instruction and data I/O to and from memory. No progress had been made ever since.

We have developed a post-Harvard CPU architecture that does away with the scale-out limitations imposed by current processors and accelerators, including GPGPUs.

Our solution adds a scale-out port to the CPU and accelerator core, among other items, that allows direct connectivity between general-purpose CPU cores, accelerator cores and smart multi-homed memories, and secondary infrastructure such as peripherals. This allows us to separate I/O from peripherals and all of the aforementioned from any kind of Inter Processor Communication (IPC), and optimize all communication channels accordingly.

Securing a Server

Posted on by: Axel Kloth

I must have poked into a hornet's nest with my two recent blog posts on BMCs and on DPUs. I got a whole bunch of angry emails in response. That is a good thing.

In short, neither a DPU nor a BMC alone can secure a server. It takes vastly more than that, and most importantly, the server's host CPU and the BMC's BIOS must be fully secured, encrypted and authenticated.

Let me recap how to secure a server these days. Unlike a decade ago, a server today contains a whole bunch of smart devices that have their own processors and boot code and BIOSes. As a result, all of them must be secured individually to secure the server as a whole. Here is an incomplete list: CPU(s); DPU(s); SAS, SATA and RAID Controllers if present; all accelerators including GPGPUs; the authentication processor in the Root of Trust coprocessor if present; the TPM or vTPM and of course the BMC. All of these have their own firmware and BIOS, so those must be protected against insertion of malicious code, and that only works if the code is encrypted to individual keys, not a manufacturer's key. In case of detected tampering, the device must not start up. Anything in the net user data path is critical and should be protected against the threats that usually affect data contents. The situation is different and even more crucial for the BMC, which is not in the net user data path, but access to it opens up literally everything in the server to an intruder. As such, the BMC must be secured as tightly as possible. Since the BMC is not in the net user data path, it can only protect itself and the OAM&P path of the boot CPU, nothing else. Those two items must never be confused.

ARM IPO

Posted on by: Axel Kloth

ARM is having an Initial Public Offering (again). This time around I am even less enthusiastic about it than I was at its first IPO. Back then, ARM was still Acorn RISC Machines. It was a novel RISC Instruction Set Architecture for low-performance devices (Apple tried it in the Newton MessagePad 700/710), and the ISA was somewhat acceptable. Nothing great, nothing terrible, but certainly not built to make any of the parts that make up a processor as simple as possible. The simpler and more straightforward the ISA, the simpler the instruction decoder and pipeline, and all subsequent units. That was not ARM, but it was acceptable. Today, the ARM ISA has become too big. A big and complex ISA makes every unit in the processor overly complex and power-hungry and prone to attacks and less conducive to performance improvements.

Another drawback is that licensing anything from ARM has become too expensive. There is a good chance that even if you license the core and do not need anything else, ARM will force you to take whatever is in its portfolio on top of the processor core IP even if you have in-house IP that is better suited to your and your customer's needs. So if you have other ASIC components that were developed in-house and perform better in the application that your customer targets than anything that ARM has in its portfolio, ARM might force you to license and use that IP.

Also, if the performance of the ARM core is not what you need, you cannot touch it unless you have a very expensive architecture license. And even then, ARM may sue you if you have licensed versions A through D, and you buy a company with a valid license to E, and you now incorporate version E into your products...

I do not like ARM's corporate leadership, and neither do I like its current owner, Softbank and its Vision Fund.

I will absolutely stick with RISC-V for all of those reasons. I can do whatever I want and implement it any way I want, and for as long as I can show I am in compliance with the RISC-V ISA, I am certifiable.

I will also not invest in ARM, and I will make sure that none of my portfolios buy into ARM's IPO. Even if Apple, Intel, Qualcomm and TSMC invest in ARM's IPO, I will not touch it.

The OpenROAD Project

Posted on by: Axel Kloth

I have had a lengthy discussion with the leadership team at Precision Innovations with regards to their The OpenROAD Project, and I came away incredibly impressed. Not only are they providing an entire open source EDA toolset and related libraries for most planar transistor semiconductor design efforts, they are clearly looking into the future of ASIC design. While the commercial tools maybe able to cover more processes down to the FinFET and Gate-All-Around nodes, the OpenROAD team has identified many issues that the average ASIC Design engineer needs, and has set out to solve those. Multi-die and MCM are already included in the flows, and integration of analog and mixed-signal into digital logic design are straightforward. What impressed me even more is the willingness and ability of the team and engineering to react quickly to suggestions. We will use that toolset for all of our proof-of-concept designs, which will also allow us to compartmentalize our designs and assist in better design verification.

Definition of a DPU

Posted on by: Axel Kloth

I have received a number of emails indicating that the function of a DPU is not entirely clear. Here is a brief overview of what a DPU does, and why it is of importance. A DPU is an acronym for Data Processing Unit, but that is very generic and does not say much. Generally, a DPU is a smart Network Interface Card (NIC) that allows certain traffic to be offloaded off the CPU. In the early days, smart NICs offloaded only some TCP processing from the server CPU, while all UDP traffic still needed to be processed by the server CPU. That prove to not be very useful, and as such smart NICs never really took off. That changed when the offload was augmented by allowing the processor on the smart NIC to autonomously transfer data from the host CPUs' memory to the recipients’ smart NIC, which in turn transferred the data to the receiving side's CPU's memory. That reduced the server CPU’s task to direct the DPU to conduct a transfer of data in region A in its memory to a destination server’s memory in region B. In other words, the server CPU had to send only a very short command with very few parameters to the DPU, and the DPU would then autonomously execute and transfer the data without further input from the CPU. This is called DMA, and in case the initiator requests data from the other side, it is called RDMA. Once smart NICs had evolved into this feature-rich device, it became useful and indeed offloaded the host CPU. With the cost of a DPU quickly dropping, the additional computational and data transfer capabilities make technical and financial sense. On top of that, most DPUs have their own memory and Operating System, and they can be used to filter data such that inbound threats are recognized and terminated. The threat databases can be automatically updated and synchronized, and even AI can be used to identify new threat patterns, all without increasing the computational load on the host CPU. That is only possible because the DPU is in the net user data path.

All major CPU vendors are looking at DPUs, and right now NVIDIA with its acquisition of Mellanox and AMD with its Pensando acquisition have DPUs for sale or are close to offering them. All current DPUs are PCIe-attached and as such are limited in their performance by the bandwidth available on PCIe and its high latency. CXL will not change that as CXL uses PCIe as the underlying infrastructure.

Our Server-on-a-Chip has a built-in DPU as well. Our DPU uses as much hardware offload (DMA Controllers, cryptographic accelerators, authentication engines, and a key management unit) as possible, and all functions that must remain on a programmable CPU core run on a RISC-V core set with hardware support for virtualization, using a highly modified version of OpnSense. Since our DPU is on-die on the Server-on-a-Chip, the PCIe limitations do not apply.

Flash Memory Summit 2023

Posted on by: Axel Kloth

This years' Flash Memory Summit at the Santa Clara Convention Center was back to being an in-person event, like last year. It ran from 2023-08-08 to 2023-08-10, and in my opinion, it was the most well-executed FMS ever. Kudos to the organizers. There were some obvious trends:

  • AI is picking up steam
  • Because AI – particularly generative AI during training – needs so much memory, memory size matters
  • CXL is coming. I don't like it for a variety of reasons, but everyone else thinks it is the second-hottest trend after AI (and due to AI)
  • ccNUMA may be back because of AI
  • Unified memory (DRAM and mass storage may converge) – I disagree with that, but that is what I heard many times
  • SATA and SAS spinning hard disks are dead. Long live the dead. WD and Seagate still make spinning disks
  • SATA and SAS SSDs are dead as they migrate to faster M.2 and M.3 and other PCIe-attached interfaces
  • Spinning disks are not really dead as they are relegated to cold or warm storage, instead of tape
  • Tape is not dead either. Mass storage is tiered
  • Moore's Law is dead, at least in 2D. However, while there is no real 3D chipmaking yet, 2.5D helps out. All Flash manufacturers can make Flash stacks with over 200 layers. Moore's Law is alive and kicking. I have said that in my blog for the past few years
  • We generate too much data. Humans do, and now AI will add to it, and more than ever, meaning that we either sift through it, or throw it away, or store it until we can sort through it. Which means that we will need all storage that anyone can make
  • Optane is still dead, but the memory that is PCM and was not at the same time – sort of like Schroedinger's cat – may be dead and alive at the same time, simply because Flash is not fast enough, and the endurance is still only a maximum of 3000 to 5000 cycles written per cell. Intel wrote Optane PCM off, and Samsung has renamed it to Solidigm, and while Solidigm is now selling Flash, it's clear that they are looking at alternatives to Flash
  • Next year, FMS will be even better. Maybe it will even change its name, as it certainly is not only about Flash any more.

    Baseboard Management Controllers (BMCs)

    Posted on by: Axel Kloth

    Most servers have an integrated Baseboard Management Controller (BMC) on the mainboard. Its primary function is to assist in managing the server. This management includes updating the Firmware of the server, shutting it down and restarting it remotely, and interfacing with the Trusted Platform Module (TPM) and the Root of Trust coprocessor. As such, the BMC oftentimes has its own Network Interface Card (NIC) so that its function is uninhibited by the NIC of the server. In other words, even if the server had crashed and its Operating System is non-responsive, the BMC will save the day and be able to help the system administrator to remotely restart the server. It is therefore imperative that the BMC itself is very well secured against attacks, and that its NIC is independent of the servers’ NICs. The connection to the TPM allows it to use shared secrets so that Firmware updates can be authenticated. It may even contain a virtual TPM, in which case a physical TPM is not needed. The same is true for the Root of Trust coprocessor. However, the BMC is connected through PCIe to the server’s host processor, and that connection can be snooped on. Also, the host will boot from its own SPI Flash, independent of whether the BMC has access to it to update the Firmware. In other words, the BMC cannot guarantee the validity and the authenticity of the hosts’ Firmware. Rewriting the Firmware with a hardware SPI Flash tool or using malicious Software to install malicious Firmware is possible and doable. Preventing this is beyond the control of the BMC. On top of that, for the reasons mentioned above, the BMC’s NIC is not in the data path of the host processor, and as such cannot snoop on the NIC’s traffic. A Data Processing Unit (fancy term for a smart offload NIC with DMA and RMDA capabilities) can do that, and as such would be able to identify and block malicious traffic on the network, in ingress and in egress directions. A BMC cannot do that.

    The BMC is as decoupled as possible from the net user data, and most decent system admininstrators even put the net user data facing NICs on a different VLAN from any other devices that are used for operation, administration, maintenance and provisioning (OAM&P). If all net user data is on VLAN1, and all OAM&P traffic is on VLAN2, then even someone on the LAN – including an intruder – will not see the BMC.

    AI will need TB-level DRAM

    Posted on by: Axel Kloth

    Like pretty much everyone else, I have played with ChatGPT and a few other AI tools, and the more I did, the more I recognized that the convergence of Hardware requirements for AI, ML and traditional HPC are upon us. HPC always needed very large amounts of main memory, but it surprised me to see that both the training and the inference side of AI take considerable amounts of DRAM.

    Any kind of model creation maxed out our engineering server – and that is not a small machine with 64 cores and 512 GB of DRAM. On the inference side it was a bit lighter, but assuming that if something works, the public will start using it in large numbers, I can foresee that a 16 GB laptop will not do in the future. I think that DRAM in a laptop soon will have to be 64 GB, and any servers on the training side will have to be TB-level DRAM machines. I would not be surprised if two or three years from now, we will see 4 TB servers on the lower end, and larger ones with 16 to 64 TB worth of DRAM in them. Considering that a lot of power is needed to run the DRAM protocol – SSTL-2 – that might be the next barrier to bring down.

    ERC Fraud

    Posted on by: Axel Kloth

    I have a hunch that the next big revelation is going to be that there is a lot of fraudulent Employee Retention Credit activity going on. I keep receiving emails and calls alerting me to the availability of ERC for my company. One of those service providers even went so far as to say that according to the BBB (since when does the BBB have the authority and mandate to track that?), we had two employees during the ERC time frame at a salary level that would qualify for the full $26K of ERC per employee, and that he had prepared it all and I just needed to sign the document. He'd send it off and then upon the refund, would collect 15% of the loot from us. There are two problems with that: my company did not have employees during that period, and I am not sure what would happen if I signed, sent if off and received funds. Could very well be that I'd end up in prison for fraud, and that he'd be laughing from the outside as I would have defrauded the government, but the contract between me and him is a civil matter. Needless to say that I did not sign and blocked him on the phone, email and otherwise.

    I am pretty sure that two or three years down the road we’ll see a lot of innocent but naive business owners fighting in court to stay out of prison, when the criminal ERC service providers laugh their butts off all the way to the bank.

    Breach of the MSI/Intel Firmware Signing Keys

    Posted on by: Axel Kloth

    A few weeks ago on April 7th, 2023 news broke that someone broke into MSI's servers, and among other things they stole were the MSI Firmware signing keys. That is not quite correct. The keys they stole were the Intel Firmware signing keys, and that is an indication of a complete misunderstanding of security through asymmetric keys. Security can be achieved by using symmetric or asymmetric keys, and using asymmetric keys implies that there is one key to encrypt (or sign) something, and another key to decrypt it. The encryption/decryption operation requires a matched key pair. One key is called the public key because it is on purpose public so that everyone can decrypt something that you have encrypted (or signed) to protect it against impostors. In other words, someone uses a private key to encrypt (or sign) a document or a piece of Firmware or Software to make sure that no one other than he or she can encrypt (or sign) an item that can be decrypted by everyone using the public key. If someone encrypts something with an invalid private key, your valid public decryption key will not be able to decrypt it or verify the validity of the signature. This is particularly important for something as fundamental and basic as the BIOS or UEFI Firmware for your computer. You want to make sure that you do not install some malicious Firmware that someone messed with, and so you need to rely on the secrecy of the private key for the encryption of the Firmware. As a holder of a private key, you must make sure that this key never gets distributed or leaves the house or room.

    Well, Intel got that wrong and distributed the private Firmware signing keys to every manufacturer of Intel-based computers. MSI was one of them. Predictably, they got breached. MSI confirms security breach following ransomware attack claims. This is about the worst security breach possible, aside from the SolarWinds debacle. Anyone in possession of this Firmware signing key can now write a malicious version of the BIOS or UEFI for any Intel-based computer, distribute it and be certain that the user installs it thinking that it is legitimate.

    In other words, this breach has legitimized the illegitimate. The Police have become the bankrobbers, and the bankrobbers are the Police. It is very difficult to get out of this conundrum. All users must now use a different way of authenticating a Firmware update, and once that is done, the decryption keys are hopefully changed such that after a successful BIOS or UEFI Firmware update, the old key pair is retired, and all subsequent Firmware updates can go back to normal. However, this is the best case outcome. In reality, a good portion of users do not update their Firmware routinely, and only do so if and when something does not work any more. In that case, they will not be able to install the new legitimate Firmware as the keys have changed, and they will be greeted with an “Invalid Firmware” message.

    It is going to be up to Intel and MSI to fix that. How much trust can we put into that? It is not that it could not have been predicted. In fact, I wrote patents that were intended to avoid exactly this situation, and in the intro, I pointed out the vulnerabilities of the current way to deal with unauthenticated Firmware.

    The big question arising from this debacle is of course if a better Trusted Platform Module (TPM or a virtual version, vTPM) or a Root of Trust (RoT) coprocessor or a smarter and secure BMC could have prevented this. The answer to that question is an unequivocal No. The problem here is that the Firmware signing key was compromised, and all of the above measures rely on a valid and legitimate signing key. The only remedy would have a been a very different way for a processor to boot from its secure and encrypted Flash that is not predicated on a Firmware signing key. That method is described in one of my patents, and we are building a many-core processor that implements it. In fact, the newest version we are implementing in our Server-on-a-Chip is even more secure with additional safeguards for authenticity.

    Twitter and NPR in a spat

    Posted on by: Axel Kloth

    Elon Musk had incorrectly tagged NPR as "government funded media". As a result, NPR decided to leave Twitter. Link to a Politico article here: NPR leaves Twitter. While that in and by itself is pretty bad, Musk decided to reassign the handle @NPR to another company.

    The Hill expands on this here: Twitter threatens to reassign @NPR handle. I hope that NPR does not back down and in fact sues Twitter and Musk for this. Why? The Social Media companies have established a thriving parallel system to the USPTO and WIPO for trademarks.

    A handle is like and comparable to a trademark, and as such that should be under the purview of the national and international patent and trademark agreements. NPR spent a lot of time and money to establish its brand, and that brand is NPR. I imagine that Elon Musk would not be happy if on Mastodon someone claimed @Tesla and @ElonMusk as theirs. Imagine the damage to the brand NPR if Musk reassigns that to the National Pumice Recyclers (I checked, they don't exist).

    Why is this important? The patent and trademark system exists to protect intellectual property, and a trademark and name and handle belong in this category. This should not be up to private social media companies. IP and handles that are comparable to trademarks should be handled by the organizations that were set up to protect them.

    The big CXL Conundrum

    Posted on by: Axel Kloth

    It is interesting to see that everyone agrees that both hyperscalers and supercomputers today rely on an outdated architecture for processors, accelerators and memory that does not seem to work well. It is even more interesting to see that the solutions suggested don't solve the problem, but create new ones or exacerbate old ones. One of these is CXL, the Compute Express Link. In short, CXL is a secondary protocol over the PCIe infrastructure. It is intended to allow memory - in most cases that will be DRAM - to be disaggregated from the server and its processors. PCIe is a high-latency infrastructure, and as such is not suited to memory attachment. The argument is that DRAM is expensive and should be a shared resource across servers and processors. On April 19, 2023, Micha Risling, the co-founder of UnifabriX states where he thinks that CXL fits into the future of memory in the data center. The article CXL is Ready to Reshape the World’s Data Centers even mentions the latency problem, only to go on to ignore it entirely.

    If CXL is used to replace non-shared SATA Flash, then it can be made to work as CXL still has lower latency than a SATA- or SAS-attached disk, even an SSD. The problem arises when that CXL-attached memory is shared. If it is shared, then a coherency mechanism must be present to ensure that shared data has not been invalidated by a prior write access coming from a different processor. To ensure coherency, mechanisms such as MESI, MOESI and directory-based approaches exist, but all of them rely on a lookup for validity first. In other words, before a data set is read, a read access to a directory or the MESI/MOESI bits for that data set is needed to check if it is still valid, or if it has become invalid due to a modification from another processor which had fetched that data set but had not had the time to write back the modified data. If the data is still valid, then a read access can be executed while locking that data set copy in the shared CXL-attached memory to other accesses from other processors. Obviously, the more processors (including the many cores in current processors) have access to this shared memory, the higher the percentage of time during which the data set is not accessible, invalid or locked. Since CXL is such a high-latency infrastructure, the metadata traffic and the lockout times due to the long round-trip times will be a significant portion of the memory access times, and the usefulness of CXL-attached memory will be greatly diminished. In other words, sharing memory over a high-latency infrastructure such as CXL does not solve the problem; it will instead create new ones. It will be even more exacerbated if the memory is shared in an appliance that contains internal CXL switches.

    In other words, CXL is another protocol on top of PCIe and as such has the same latency problems. CXL effectively can only carry non-coherent memory traffic.

    Why is latency such an issue? I am going to simplify the situation a bit, but in essence this is what happens if a CPU (or more precisely, the processor core and its L1/L2/L3 cache) cannot access memory contents that it needs to continue working. It needs to stall, switch tasks or go to sleep. In either case, no work gets done, unless a task switch is possible with context saved to cache and context from another thread being retrieved from cache without using DRAM. All data fetches cost energy, but task switches by themselves are not executing user code. The CPU can only continue to execute user code if in fact a context switch is possible with valid data already present in one of its caches. The higher the latency to and from DRAM, the larger the caches have to be, and the more hierarchies of caches have to be present in a processor. Large caches with their TCAMs and all external inefficient I/O such as SSTL-2 are the biggest power hogs. In other words, very large shared contended and blocking DRAM accessible through a high-latency infrastructure such as PCIe and CXL enforce an ever-growing need for more caches.

    Accelerators in HPC for Beginners

    Posted on by: Axel Kloth

    Whenever I get asked what HPC is I need to find an analogy. The analogies I use most are as follows: The CEO of a company does not do any of the work that leads to the ultimate product from that company himself. The CEO hires and directs people who create the product. He or she supervises the hiring, the training, the work and workloads and quality of all involved parties, to make sure that the product is built to the specification that the customer wants. Sometimes even that is too abstract. In those cases, I try a different approach. A conductor does not play the music himself. He or she hires all necessary pieces of an orchestra and directs them to play the musical piece. He or she simply hires and supervises the execution of the play. In the same fashion we see programmable elements in a supercomputer direct the workloads to be executed by specialized accelerators. They are less flexible, sometimes not even programmable, but they are much faster in executing a task, use less energy, and use up less space on a chip, and on top of that they are much more robust and usually not vulnerable to attacks from hackers.

    HPC at Crossroads

    Posted on by: Axel Kloth

    It seems as if there is a confusion around the future of HPC, GPGPUs, special-purpose accelerators for AI (mostly the ML training part) and Quantum Computing. I have written up a short summary on where the industry is going, and Startup City has published it so that readers can familiarize themselves with the concepts, the outlook and the technologies needed. The article High-Performance Computing at a Crossroads hopefully clarifies some of the misconceptions. Abacus Semiconductor Corporation is working on processors, accelerators and smart multi-homed memories that can carry over digital bulk-CMOS technologies with improved system design over current processors until general-purpose Quantum Computers are available and affordable to solve the computational challenges of the future. As a Venture Partner at Pegasus Tech Ventures it is my responsibility to look at startups in this field and evaluate if they can advance the state of the art.

    Malware so far in 2022

    Posted on by: Axel Kloth

    It seems as if there is no letting up on malware. While in prior years we saw phishing attempts and redirections to phishing sites coming from Russia, North Korea, the Chinese Academy of Military Sciences, plenty from India and a few each from South America and from Iran, this year seems it is dominated by Russia. Particularly active was root@validcapboxes8.pserver.ru using multiple aliases off the same server. These were mostly fake loan payoff notices, fake quarterly financial results and fake annual statements. All of these files were MS Office files with embedded macros, renamed to appear as PDFs. I stopped counting the expiration notices of my email account and its password, as all of them also came from Russia. Except for one yesterday, coming from Iran. Nothing from China or North Korea, India or South America this year so far. It is pretty annoying, and that they can’t be caught is somewhat disturbing.

    FMS in 2022 as an in-person Event

    Posted on by: Axel Kloth

    The Flash Memory Summit is back to being an in-person event for 2022. While I am not presenting or organizing a panel this year, I am still on the Conference Advisory Board. Check out FMS for 2022, its agenda and its CAB!

    Broadcom acquiring VMWare

    Posted on by: Axel Kloth

    Broadcom has announced that it is buying VMWare. Broadcom is a fabless semiconductor company that had a historic focus on communication ICs and particularly switch fabrics (the Tomahawk series in particular). It was acquired by Avago, which in turn was a spinoff of HP/Agilent Semi. Avago renamed itself Broadcom after the acquisition had completed. While switch fabrics are still part of its core business, Broadcom has tried to diversify itself in the past 5 years. It first acquired CA (formerly Computer Associates) and then Symantec’s enterprise division (this is now called Norton LifeLock). It is unclear to me where the synergies in these acquisition are, and if there is any cross-pollination of technology between those units. The same holds true for VMWare. Broadcom does not make the server CPUs that power the hyperscalers, nor does it make smart NICs or DPUs (Data Processing Units) as they are called today. VMWare would benefit from server CPUs with virtualization hardware support, which Broadcom does not make, and it would benefit from smart NICs with support for IOMMU tasks and any hardware-assisted protocols between server CPU and NIC as well as NIC-to-NIC protocol offload. Those would be synergies that create value for customers – but Broadcom does neither. As such, I can only see a sales channel that Broadcom offers. The question then is if VMWare needs a different sales channel.

    That acquisition of course leaves an upside for anyone who starts a virtualization company today.

    Scientists are cracking HIV

    Posted on by: Axel Kloth

    I keep being asked the question what supercomputers are good for. Further down in the blog, I had written up a list of applications that are typically deployed on supercomputers. The newest one that I found was that Supercomputing helps reveal weaknesses in HIV-1 virus, and usually finding weaknesses in any adversary leads to exploitation of that weakness, and ultimately the elimination of that adversary.

    Now I am waiting for the common cold, the flu, allergies such as hay fever and a whole bunch of other ailments to be looked at... I need to look up the numbers, but people calling in sick for these ailments costs the world economy a very large amount of money. If we can get rid of these things cheaply and quickly, that would save the world a lot of money that could be used more effectively somewhere else.

    More vulnerabilities

    Posted on by: Axel Kloth

    It seems to me that the implementation of crypto engines is not going well. Hackers can steal crypto keys on Intel, AMD CPUs via ‘Hertzbleed’ vulnerability. Certain functions should not be executed in a CPU core, and instead they should be done in dedicated hardware. That makes design and verification easier, and it does not require frequency scaling with all of its vulnerabilities. Any weakness can be exploited. The more complex a system is - and software-based systems are more complex than hardware-based systems - the more attack surfaces will be present.

    That is one of the many reasons that we do not execute any cryptographic functions in software, and we do not keep keys in easy reach of software.

    GSA Silicon Leadership Summit 2022

    Posted on by: Axel Kloth

    I have attended my first large-scale in-person event ever since the COVID-19 pandemic broke out. The GSA Silicon Leadership Summit on May 12th at the Santa Clara Convention Center was not only well-executed, but also well-attended. Its title - New Horizons - was befitting the event and global developments. My key takeaways were all positive. The semiconductor industry is going to continue to grow, and it will hit the $1T revenue mark some time in 2030 or a few years thereafter. That is a fantastic achievement given that only a few decades ago this industry did not exist, and the then-CEO of IBM anticipated that there maybe a need for 5 computers worldwide. What a difference a predictable cost reduction makes on a market. Today, smart phones have more computational performance than those computers that the IBM CEO referred to. As usual the limitations are I/O, and it seems this time around we will see large-scale deployment of optical I/O directly out of processor packages a few years down the road. Novel memories are needed to support traditional processors using the von-Neumann architecture, and for non-von-Neumann machines, we seem to have a few new paradigms up our sleeves. We still must secure the Internet, and AI will be able to augment some Human Intelligence in an ethical fashion. The headwinds are the current trends towards de-globalization, trade restrictions and at this point in time, supply chain issues. These headwinds can all be overcome, and I am very positive on the semiconductor industry in general.

    More on Intel Optane

    Posted on by: Axel Kloth

    In February I had found an article from Tom's Hardware pointing out the losses Intel had endured with Optane. My assumption was that Intel would continue selling the business version of Optane until they run out of stock since the JV between Micron and Intel for the production of Optane had been shut down, and the fabrication facility was closed and then sold by Micron. Intel has no other source for Optane memory, and as a result, when the stock is depleted, it's over: Intel has Optane chip hoard with no plans to develop tech.

    The consumer-facing side of Optane was sold to SK Hynix America, which now rebranded it as Solidigm. Apparently, it is doing quite well under its new ownership. Tom's Hardware reports that Solidigm Unveils D7 Series Data Center SSDs: Up to 15.36TB, 7100MB/s - that is an impressive number.

    More Firmware Attacks

    Posted on by: Axel Kloth

    There are more and more BIOS/UEFI attack vectors in the wild, and the problem is that those are not theoretical only in nature, these are actively exploited. While these newest links seems to indicate that Dell is more affected than others, it is not the case. I am not bashing Dell here, as most of the BIOS and UEFI code is common amongst all PC manufacturers.

    I think that we have reached a point at which it is impossible to continue on as if nothing has happened. The traditional system architecture is flawed in so many ways that we need to rethink it. It is preventing the industry from achieving better (i.e. more linear) performance scale-out, from better integration of accelerators, and from vastly improved security. The old adage of "never change a running system" must finally be overcome.

    Will the Cloud eat HPC?

    Posted on by: Axel Kloth

    The Cloud has come a long way from its first days and inception as EC2. Most companies these days use Cloud computing and technologies in one way or another. Abstracting from processor Instruction Set Architectures (ISAs) and using containers to allow dynamic shifting of workloads have all been invented for and in the Cloud. Many things we take for granted today were impossible to do just a few years ago, and that is an incredible accomplishment. Today, the Cloud is still someone else's data center as was the case 15 years ago, but new technologies have been introduced to make Cloud services more palatable. What has not changed much is the level of abstraction users get from the Cloud - in fact, if at all, the abstraction level went up, and there are more layers of abstraction (and therefore translation and compute) than ever before. While there are bare metal Cloud offerings, they do not provide the Cloud benefits, and the performance level is usually not what could be expected from an on-premise installation. There are attempts to solve this and bring HPC to the Cloud (or seen from the other perspective, extend HPC to the Cloud). The Next Platform posits a very valid question. Will HPC Be Eaten By Hyperscalers And Clouds?. We believe that this is not the case.

    I have explained in my blog post "What is HPC anyways?" what the differences between the Cloud and HPC are.

    While the Cloud and HPC systems are similar, they are not identical, both in terms of hardware used, and in the applications running on them. In short, in typical Cloud applications thousands of applications run on thousands of servers. In HPC, one applications with an enormous data set is running on thousands or tens of thousands of servers.

    Abacus Semi admitted to NewChip cohort

    Posted on by: Axel Kloth

    We are proud to announce that we have been admitted to the March cohort of the prestigious NewChip Seed & Series A Accelerator program. This will help us raise awareness of the company and the products to a broader set of investors. Abacus Semiconductor Corporation is re-imagining HPC to remove the existing interconnect bottlenecks, resulting in greater than one order of magnitude increase in application performance compared to what is possible with existing processors, accelerators and memory.

    Inclusion in the accelerator program will also support our goal of creating an ecosystem around us that will enable other companies to license our technology so that their accelerators - GPGPUs or special-purpose ASICs - can benefit from this improvement in interconnects.

    EU CHIPS Act update

    Posted on by: Axel Kloth

    PCGamer repports that Europe sets sights on global semiconductor domination. I had mentioned that the 11 Billion Euros that I read in the article are by far not enough, and I posted this here here. Turns out that I either did not read far enough through the entire legislation or the additional grants were hidden somewhere else, but very clearly the EU knows that 11 Billion does not pay for much. The real number appears to be closer to 70 Billion Euros, as PC Gamer sates: "To this end the act also includes the potential to invest €30 billion in building fabrication centres by 2030. This puts the total spend at around $70 billion USD over the next ten years." That is a different ballgame and should get Europe back into semiconductor manufacturing. I am glad that this is the case.

    I believe that Europe will benefit from this agreement and money that is invested into the design and manufacturing of semiconductors.

    Firmware Attacks

    Posted on by: Axel Kloth

    Current defenses against malware attacks usually rely on software running on an x86-64 machine. This software can be hosted in a firewall, in a dedicated server to detect endpoint-compromising attacks, inside a mail server to detect spam and phishing attacks, and in a wide varierty of other devices, including end user client devices. In nearly all cases, the integrity of the BIOS/UEFI and oftentimes the Operating System are inherently assumed. That assumption is dangerous since it's wrong. The Operating System can be compromised easily - no Operating System I know of is impenetrable. What is worse is the assumption that the BIOS or UEFI cannot be compromised. In every server, there is a BMC, and the BMC can update the host firmware. In other words, any attacker can circumvent Operating System provisions to protect the firmware by taking a short detour through the BMC. In most cases, this goes undetected, and the BIOS/UEFI modifications are persistent and even survive an Operating System reinstall.

    I keep hearing that those attacks are hypothetical only, and that there are not many of them out there in the wild, and even if so, they have no relevance. Wrong. Here is short sample of what has been published lately.

    Hardware must be used to avoid this, and the BMC route to updating the hosts' firmware must be made vastly more secure, ideally by using better credentials than just username and password. The technology to achieve this exists, and we have it. We call it Assured Firmware Integrity or AFI™, Resilient Secure Boot or RSB™ and Protected Shadow ROM or PSR™

    Intel Optane

    Posted on by: Axel Kloth

    I just stumbled across this article about Intel's Optane at Tom's Hardware. They found out that Intel's Optane Business Haemorrhaged Over Half a Billion Dollars in 2020.

    That is a lot of money, particularly taking into account that this started out as a Joint Venture between Intel and Micron in 2015, with high hopes to close the power and density gap between DRAM and NAND Flash. In 2021, Micron had called it quits and bailed out of the JV. As far as I am aware, Micron let all of the developers go that would not want to transition to the DRAM or NAND Flash groups within Micron. Micron also sold the 3D XP fab to TI, and I was under the impression that there was no supply agreement between TI and Intel. As a result, I thought that Intel had closed down all of its Optane business (which was the trademark Intel held for the 3D XP technology).

    It certainly did not help that 3DXP was Phase Change Memory, but Intel chose to deny that it was. More importantly, 3D XP never fulfilled the promise of DRAM-like performance at NAND-Flash density and power.

    Apparently, Intel continued to sell Optane-branded SSDs, but clearly at a loss, and with no upgrade path for the technology and the devices themselves. When the stock is depleted, I assume that Intel will simply shut down this brand and technology.

    nVidia/ARM deal off

    Posted on by: Axel Kloth

    The official confirmation that the nVidia/ARM deal is off came in. Ars Technica reports that the $66 billion deal for Nvidia to purchase Arm collapses.

    I am not surprised, and in fact, I think that this is a good thing. There will be some assurance that ARM will continue to be the Switzerland of processor and semiconductor IP of choice, and the ISA will continue to be somewhat of a lingua franca. However, long term a good number of ARM licensees will look for alternatives to ARM as an outcome of this ordeal. I had voiced my concerns early on.

    I believe that RISC-V will substantially benefit from this.

    EU agrees on a CHIPS Act

    Posted on by: Axel Kloth

    The EU is investing 11 Billion Euros into the semiconductor industry. Intel is investing $20B in the next few years, and TSMC is pledging $100B in the next few years. The US has a $55B CHIPS act, and it remains to be seen how much Korea, Japan and China are going to put up. The EU pledge is simply not enough to make a difference. It can be found here: Digital sovereignty: Commission proposes Chips Act to confront semiconductor shortages and strengthen Europe's technological leadership. It also comes right on the heels of the announcement that Margrethe Vestager, the EU’s Commissioner for Competition, declared that "achieving semiconductor independency is ‘not doable’".

    Intel's Strategy on its competition

    Posted on by: Axel Kloth

    I think that by now I understand Intel's strategy with regards to ARM, nVidia and RISC-V. They all tie in together and must be seen as a whole.

    Intel has understood for quite a while that they have a formidable competitor in ARM and in nVidia. The ensuing steps were brilliant from a business strategy perspective. It is in fact the old "divide and conquer" method.

    Intel invested in SiFive as the first of the RISC-V commercialization plays to make sure that ARM's growth can be stunted. Intel had understood quite a while ago that it had lost its ability to compete with ARM in the cell phone, smart phone, tablet and low-end laptop markets. Intel had to have a way to stop ARM from dominating the industrial IOT (IIOT) market, and RISC-V is a perfect method to achieve that. RISC-V as architected and first implemented as RochetChip had all of the necessary ingredients to prevent ARM from completely dominating the IIOT market where Intel's x86-64 had no chance of competing. Investing in SiFive offered a way to steer SiFive into the IIOT market and help build out the ecosystem around RISC-V, particularly focusing on tool and IP development for the embedded and IIOT markets. This ensured that ARM would have a viable competitor in the IIOT market without affecting Intel's cash cow, the data center market.

    That means that Intel could focus investments in fabs and in the data center market. ARM would be taken care of by RISC-V in ARM's native domain. In the data center market, Intel would only have to compete with AMD and with nVidia. Investment in RISC-V would also give Intel a strategy in case x86-64 really started being unable to compete technically. It was a cheap insurance against any surprises.

    In case nVidia and more specifically CUDA were becoming too much of an economic problem for Intel, then a simple way to cut that off would be by investing in special-purpose accelerators, such as Cerebras, GraphCore or any others (including us) for evolving new requirements, and developing CUDA compatibility for all of the accelerators.

    The termination of the nVidia and ARM merger is a boon to Intel as it forces current ARM licensees to rethink their strategy, and nVidia will likely return to the RISC-V table.

    EU is giving up on becoming independent

    Posted on by: Axel Kloth

    The EU had grand plans to become independent of the US, the UK, China and anyone else in the design and manufacture of semiconductors. That goal alone highlights a complete misunderstanding of how semiconductor design and manufacturing works. First of all, tools are needed to design the products. Those tools are non-trivial to create and to use. Students have to be educated in their use. and proficiency must be achieved. Then, after the design phase, logical correctness must be established, with yet another set of tools. These tools again are non-trivial to write and to use. After the verification of the logical correctness of the design, the physical design phase starts, and that is non-trivial as well. These tools are specific to the manufacturing process, so they cannot be designed in a vacuum. This is a collaboration between the manufacturing plant or "fab" and the physical design tool designer, creating a PDK (or Process Development Kit). Once the physical design is done and all components are placed, it has to be verified that the logic design is reflected in the physical design's implementation, and whether the targeted clock frequency can be achieved. This is the dynamic timing closure phase, which can unravel a lot of the physical design because unlike for mathematical (Boolean) logic, light and electrical impulses do not travel at infinite speeds. If a signal path is too long, parts of the design have to be relocated. This is an iterative process that can take weeks and hundreds to thousands of hours of CPU time on a large-scale computer cluster. It was clear to anyone inside the industry that the EU would not be able to achieve full independence.

    I had anticipated that the EU would declare that some manufacturing will have to be lured back to the EU member states, with incentives being paid and well-trained workers being made available through continued education paid for by the EU. I had also anticipated that the EU would declare a preferred CPU Instruction Set Architecture (ISA) that must be used for all military and crucial infrastructure projects. RISC-V would have been a perfect choice.

    I have been wrong, as I have been so many times when it comes to predicting actions that politicians take.

    Margrethe Vestager, the EU’s competition chief, declared complete and unconditional surrender. Achieving semiconductor independency is ‘not doable,’ EU competition chief says.

    No preferred ISA, no preferred High Level Language (HLL) for semiconductor design, no embrace of tools such as CHISEL, no luring back semiconductor fabs to Europe, no on-the job retraining of engineers to target semiconductor design. Not even trying to retain processor design talent so that at least the design of processors and accelerators remains a possibility in Europe. Nothing. I'd call that a complete and unconditional surrender.

    Entirely giving up semiconductor independence of course also means giving up on leading High Performance Compute (HPC). In other words, the EU will continue to use US or Chinese processors and accelerators and memory to power their next-generation Supercomputers. Who says that there are no back doors in there?

    Apple prove that the ISA is not relevant

    Posted on by: Axel Kloth

    I keep hearing that the Instruction Set Architecture (ISA) is important, and that without binary compatibility of our processors to the Industry Standard we have no market.

    That is a gross misunderstanding of how things works these days.

    First of all, Apple has changed processors and ISAs multiple times now. Which processors and ISAs did Apple use over time? Apple started out with the 68000 from Motorola. When that ran out of steam, Apple changed to POWER/PowerPC (IBM). IBM discontinued that product line, so Apple switched to x86-64 from Intel. When Apple saw that Intel promised better performance but at ever-increasing levels of power consumption, Apple had to find a new way to improve performance on the same trajectory as Intel promised with x86-64, but with a decreased level of power consumption. That required a different ISA and a different manufacturing process, and so Apple switch to ARM Processors that were designed in-house and fabricated at TSMC. These developments became the A and M series processors for the iPhone and the Macs.

    None of these processors share an ISA.

    Every single time Apple changed ISAs, there was an outcry from people who did not know any better that such a switch would be devastating, and that it could not possibly work. Every single time it went without large hiccups, largely due to the fact that the Operating System (OS) is not written in the processors' assembly language, but in a higher-level language, typically in C. That means that code rewrite is limited to very small portions of the OS. With ever-better compilers such as LLVM/CLANG, recompiling the rest of the OS becomes a fairly manageable task.

    Four ISAs over time. No substantial problems.

    In other words, the ISA has become less relevant.

    Another change that took place is the web, or more specifically, the xAMP stack. The xAMP stack is a software stack comprised of an Operating System (usually Linux for the LAMP stack, FreeBSD for the FAMP stack, and Windows for the WAMP stack), Apache as the web frontend, mySQL as the database, and PHP/Perl/Python as the scripting langauge. The Internet is built on and predicated on the xAMP stack. Nearly everything in the "backend" of the Internet runs on top of a xAMP stack. How much?

    According to Pronskiy, PHP runs PHP runs "78 per cent of the Web," though the figure is misleading bearing in mind that this is partly thanks to the huge popularity of WordPress, as well as Drupal and other PHP-based content management systems. PHP is some way down the list of most popular programming languages, 11th on the most recent StackOverflow list, and sixth on the latest GitHub survey, down two places from 2019.

    If PHP runs 78% of the web, then by the very definition of the stack it must run on 78% of all web servers. So 78% of all web servers run the xAMP stack.

    In other words, if we have LAMP, FAMP or WAMP running on any given server with any CPU that supports this stack, we cover 78% of all servers and/or traffic.

    Linux and FreeBSD run on RISC-V as of today. Apache is ported, and mySQL has been for quite a while, and PHP even with the JIT is allegedly running. In other words, LAMP and FAMP run on RISC-V as of today.

    While works are underway to port better databases than mySQL (such as ScyllaDB, Cassandra, postgreSQL or KeyDB), they are not necessary, for RISC-V to run any web applications. RISC-V is not a niche product for which drivers are hard to come by. Drivers can be an issue for an RTOS in an embedded device, but today RISC-V is already running Linux and FreeBSD with all necessary drivers.

    What is HPC anyways?

    Posted on by: Axel Kloth

    I keep getting questions about HPC and Supercomputers that make me think that the industry has not done a great job in explaining what HPC is, why Supercomputers are needed, and why a Supercomputer is not just a hyperscalers' data center.

    Let me first answer what HPC is. HPC is a desciption for a segment of compute that deals with very large-scale problems. Weather forecast with a good accuracy over more than 5 days still is an HPC problem. The number of input parameters and the size of the volume elements largely determine the computational effort and the accuracy of the result, particularly if the result has to precise enough to take action 5 days out or more. Climate modeling is another application for HPC. Any large-scale Finite Element Method (FEM) being used to statically or dynamically simulate the behavior of a system falls in the category of HPC. Crash tests for cars under development that simulate the behavior of a car in an accident are HPC. Studying how to contain the plasma in a nuclear fusion reactor certainly qualifies as HPC. Understanding how the Corona virus reacts with cells is an HPC application.

    HPC applications are usually executed on one or more Supercomputers. Why? If a large-scale problem must be solved, we can either use a very large computer cluster with many CPU and accelerator cores to solve the problem in reasonable time, or if we have too much time at our hands, we can use a small cluster of servers and wait for weeks or months or even years for the result.

    So then why don't we use "The Cloud" as a supercomputer to solve HPC problems? First of all, "The Cloud" really is someone else's data center - in most cases, it will be Google's or Amazons' or Microsoft's servers in a data center with tens, if not hundreds, of thousands of servers. In "The Cloud" tens of thousands of applications for tens of thousands of users run on tens of thousands of servers, using Kubernetes to make it possible to shift workloads. The individual workloads are small, and so is the need for disk or network I/O per user or per application. There is very little need for those applications to communicate with other user's applications. As a result, there is not much need for low-latency communication between servers. That led to the development of containers and Kubernetes, which allows an even higher level of abstraction and now allows to move workloads from one server to another, in case of failure or if that server starts being overloaded and response times increase. In other words, in "The Cloud" the server, its CPU and its memory as well as local disks are the performance-dominating parts. The interconnect plays a very small role.

    In a Supercomputer, the processor, accelerators, memory and local disk are important, but since we know that the computational problem is very large and requires thousands of servers work in concert to solve the problem, the interconnect plays a crucial role. Imagine an employee sitting at a desk, and he or she shares the work with a colleague across the desk. If there is any question, the employee can immediately get access to the colleague and ask the question and clarify whatever needs to be explained. That is a very low latency and high bandwidth interconnect. Now imagine the colleague is on a different floor. As a result, the employee has to get up, go to the other floor, and then find the colleague and ask the question. We may still have the same bandwidth of communication between the two employees, but the latency has increased. That has an immediate impact on the granularity of the tasks that can be shared. To get up, go to another floor and find the colleague takes enough time to reassess if it is worth the effort, or if one should try to solve the problem by oneself. Only if the problem is large enough to justify the time lost finding the colleague would the employee try to farm out the problem to another employee. That is exactly the same in a Supercomputer. The higher the latency between two processors or processor and accelerator or their memory, the larger the task has to be to be farmed out to make sense. The aggravating factor here is that high latency even impacts metadata such as the simple question whether the other processor is busy, or if it can take on the task in the first place.

    In other words, in a Supercomputer the interconnect plays a crucial role that is due to the fact that the workloads are different from a data center at one of the hyperscalers. That is why Supercomputers or HPC as a Service will have to wait until there is a unified processor, accelerator and memory architecture that can serve both purposes at a cost comparable to today's industry standard machines.

    No more VPNs needed?

    Posted on by: Axel Kloth

    Tom's Guide reported that Security experts say you no longer need a VPN — here's why.

    When I read that, I have to say I was perplexed. I was surprised not only by the author, but also by the security experts. The argument that the security experts made was that all traffic is encrypted anyway, and therefore the need to use a VPN (which encrypts traffic and provides you with a trusted DNS from your VPN provider, your home or your own DNS Server) is not there any more.

    Needless to say, I fundamentally disagree. Your traffic is only protected if you access only and exclusively sites that use SSL/TLS and are signified by the lock in the address bar, and they start with https:// and then follows the URL. However, if you don't conduct your email with web mail clients or use FTP or any protocol other than secure http, then your traffic is all in cleartext. WLAN snooping will allow anyone to conduct a snooping operation, or worse, insert himself/herself as a man in the middle in a MITM attack. That's simply not acceptable because a lot of the work we do remotely is non-https traffic. Leaving all of that in cleartext is dangerous.

    It seems like they did not quite trust their own advice either, since there was a qualifier at end of the article under "How to protect yourself without a VPN", and then this gem was included: "Set up a private VPN server on your high-end or gaming router, or "flash" a cheap router with free firmware like DD-WRT or Tomato, so laptops and mobile devices can use your secure home broadband connection while out of the house". So you protect yourself with a VPN without using a VPN, and VPNs are generally useless these days. This is not only circular logic, this is plain and simple illogical and bad advice.

    My advice: ignore Tom's Guide's advice and use a VPN server in your house or business, and then install the VPN client on your phone and laptop, and you will be safe anywhere. VPN servers are cheap and easy to configure these days. Check out my FOSS recommendations here.

    nVidia's takeover of ARM in trouble

    Posted on by: Axel Kloth

    When nVidia proposed to acquire ARM, I had my doubts on multiple levels. ARM China is a hot mess, and there is no resolution in sight. As an nVidia subsidiary, the licensing terms and conditions for anything from ARM (not only processor cores) would have changed, and that would have put startups in a bad position if they banked their existence on ARM IP. It likely would have also impacted the large licensees such as Apple, Qualcomm, Samsung and many others. Now, it seems the deal is in trouble, and according to Bloomberg, Nvidia Quietly Prepares to Abandon $40 Billion Arm Bid.

    I see that as a net positive. If that acquisition fails, nVidia will return to the RISC-V table, and it will help grow that ecosystem. That means that we are left with x86-64 for servers and desktops, ARM for smart phones, feature phones, tablets and some server processors, and possibly the laptop market. It also helps RISC-V as ARM won't be as dominating as it is now. I can foresee RISC-V in the edge compute market, in laptops (provided that someone builds yet another beautiful and easy-to-use GUI on top of FreeBSD), and in scaled-out HPC, which is what we are doing.

    In other words, we won't see ARM take over as the next Intel when it comes to ISA (Instruction Set Architecture) monopolies. That's a good thing, despite the fact that ISAs today do not carry the same importance that they did 20 years ago.

    Here are the links to my blog entries highlighting the problems I saw: FTC opens probe into nVidia and ARM merger, nVidia and ARM merger hits roadbumps, Apple and ARM and nVidia buying ARM.

    A flat tire is not a software problem

    Posted on by: Axel Kloth

    I hear more and more often that hardware does not matter. Software will solve all of the world's problems, and the AI generation modified that to "AI will solve the world's problems". Well, no. Plain and simple: this is wrong. The Instruction Set Architecture might not matter as much. But hardware matters.

    Software and even AI (which of course uses some software and lots of APIs) runs on some hardware of some type. Be it a general-purpose CPU, a special purpose processor or coprocessor, or an accelerator of some type. It's hardware all of this stuff runs on. With GPCPUs, GPGPUs and most accelerators built based on some premise from 20 years ago, we need to re-evaluate that hardware. That includes the CPU, all of main memory and mass storage, accelerators, interconnects and how we deal with I/O, DMA and Interrupt Requests.

    A flat tire is not a software problem. You can spend dozens of hours hacking the car's tire pressure monitoring system (TPMS) to allow it to continue to drive, but in the end you will ruin the tire, the rim and eventually bottom out on the brake disk, then the wheel hub, and then the frame or unibody frame members of the vehicle and ultimately destroy them in that order.

    A flat tire is a hardware problem that needs to be fixed.

    I just watched a Youtube video explaining the radix sort. We all know that Google and Youtube know everything that can be known in this universe, but I had a healthy laugh. Why? Because the premise is to not compare and instead create new lists in DRAM, read from those lists, and even use pointers in DRAM to list elements in DRAM. While the total number of operations required to sort is in fact vastly lower between the radix sort and quicksort and bubble sort and other sorting algorithms, guess what is the slowest thing you can do in today's computers? If you guessed DRAM reads and writes, you'd be correct. One of the fastest operations a CPU executes? If you guessed a compare, you'd again be right. So... very clearly the software developers have not talked to the processor designers in 10 years. Or maybe 20. It is time to fix that. CMP or BNE are fast. DRAM reads and writes are not.

    Democratizing Chip Design

    Posted on by: Axel Kloth

    Mike Wishart and Lucio Lanza in EETimes explain why chip design experiences a renaissance. They claim that The Democratization of Chip Design leads to many new entrants into the IC and processor design spaces. To a degree that is correct as the barrier to entry is lowered with new languages that are vastly easier to understand. For example, CHISEL and Scala are used to generate a RISC-V processor with all of its peripherals, and no Verilog or VHDL are required to write a RocketChip RISC-V processor. However, Verilog is created out of the generator languages, and that needs to be understood and modified and integrated into the rest of the design. While I agree with Mike and Lucio that in fact it makes a whole lot of things a whole lot easier, I am not sure if we will see more entrants into the realm of processor design. We have been using CHISEL, Scala and RISC-V since 2012. During that time, we did not experience many newcomers. What I can envision is that many more companies will sprout up that develop microcontrollers for many more special-purpose applications. Why? With the design of the processor core out of the way, all kinds of accelerators and peripherals and I/O ports can be designed with relative ease, and that is what microcontrollers are: a processor core with just enough performance and an industry-standard Instruction Set Architecture, and lots of other IP around it. That IP around it can be written in Verilog, in VHDL, in Scala/CHISEL or in any other language that is fit for the purpose.

    The Significance of the xAMP Stack

    Posted on by: Axel Kloth

    The Internet has been around for the average user for more than 20 years now.

    A good portion of its success was that it established a homogeneous platform on the server-side backend upon everyone could build additional applications. This platform is what is called LAMP, which is an acronym for Linux, Apache, mySQL and PHP. While alternatives for each component exist (there are alternatives to the underlying Linux OS such as Windows and FreeBSD as well), Linux prevails and is the most-often used OS in the backend. Apache has had a few competing solutions that focus on better performance or scalability, such as nginx. The same is true for mySQL – more modern databases including in-memory databases - have sprung up and can replace mySQL in the LAMP stack. PHP (and Perl) form the foundation of the software running on top of LAMP. Both are interpreted (scripting) languages. In other words, for as long as applications rely on structured query language commands compatible with mySQL, use the same protocol as Apache, and can execute PHP (or Perl) commands, the very vast majority of the applications of the Internet stack will work without modifications, and since everything on top of the LAMP stack relies on interpreted languages, they are ISA-independent. As a result, Internet applications will work on x86-64 processors, on ARM or on MIPS processors, and on RISC-V processors without recompiling. The processors’ Instruction Set Architecture (ISA) has become less relevant with the lingua franca of the Internet.

    Recompiling is only needed for the basic applications that make up the LAMP (or WAMP or FAMP or xAMP) stack. In essence, that means that applications that make up the Internet backend will work on any processor once the xAMP stack has been made available by compiling towards this processor architecture or ISA. Any new processor ISA that is supposed to be used in internet backend applications therefore only needs to provide an Operating System, Apache or a compatible web server, mySQL or a compatible database, and a PHP interpreter. With those few components that can easily be created from the open source repositories with an appropriate compiler, a xAMP stack can be provided such that this processor architecture can be deployed in servers on the Internet backend.

    Modern compilers such as LLVM/CLANG can even be used to allow a processor to execute a different processors’ ISA. In other words, Apple’s M1 Pro in conjunction with LLVM/CLANG can execute native x86-64 instructions, which may be necessary if a native application is not yet available. Depending on the quality of the emulation and its hardware support, a processor might be able to execute a different processor's ISA in near real time.

    Understanding the Human Brain needs HPC

    Posted on by: Axel Kloth

    I keep hearing that HPC is too abstract and no one understands why we need it. I am not quite sure how to answer this. There are so many applications for HPC that go unnoticed that I can understand why they are not on top of the mind of the layperson, but in reality they have an impact on everyone's life. Weather and climate forecasts, crash tests, computational fluid dynamics and most bio-engineering research are HPC applications. The human brain is another one, so if you can't wrap your brain around it, then it is because it is being simulated for researchers to understand it better!

    An excellent overview of what that kind of research does and what it can accomplish is summarized here at Human Brain Project: Researchers outline how brain research makes new demands on supercomputing.

    Pat meets resistance at Intel's reorg

    Posted on by: Axel Kloth

    It looks like Pat Gelsinger is doing all the right things at Intel - and predictably runs into resistance, both internal and external. I bet a lot of analysts don't like the new strategy, and unfortunately, too many investors listen to analysts. In my opinion, analysts are Monday morning quarterbacks. They have no insight and no responsibilities, take no risk, but feel free to criticize after the fact. They are wrong more often than those who have to run a business and make decisions.

    Among others, Pat has reorganized the HPC group (Intel Confirms Damkroger Out as Head of HPC; McVeigh to Lead Newly Formed Super Computer Group) after splitting it up in two: Intel Reorgs HPC Group, Creates Two ‘Super Compute’ Groups.

    Cooling Technologies for Data Centers and HPC

    Posted on by: Axel Kloth

    I have tried to figure out the power consumption of the totality of today's data centers, supercomputers, the Internet backend, the Internet itself with its Points of Presence, the last-mile providers such as ComCast and AT&T and their international equivalents, and the numbers I have found indicate a staggering degree of uncertainty.

    The numbers I found vary between roughly 1% and close to 7% of the worldwide generation of electricity, and that is not even including the power consumed by the miners of bitcoin and the like. There are not quite 200 countries on this planet. If it is 1% then this would mean that the power consumption fueled by our digital habits is higher than the power consumption of a good number of entire countries. If it is approaching 7%, then that would make it so large that entire clusters of countries (such as for example all of Northern Africa) use less power than what's needed to run our digital economy. While I am as guilty as anyone else of contributing to this, it occurs to me that we must do something to cut that power consumption back. A good portion of the power needed for data centers revolves around cooling, and to me, that seems like it is the easiest part to address quickly.

    Today, data centers are cooled via horizontal movement of air through servers and top-of-the-rack switches, and subsequent vertical movement of that hot air to the ceiling, where it is then extracted and cooled. That's the dumbest way of cooling as air does not transport heat well, and heated air has a lower density than cold air and as such wants to rise. At the very least data centers should adopt the former TelCo standards for racks with no horizontal boards allowed to let air rise vertically through the rack, from the raised floor providing cool air to the ceiling where heat is removed. Ideally though, the industry converts to liquid cooling. As The Next Platform mentions, Liquid-Cooled Systems Are Inevitable, But Not Necessarily Profitable. The industry seems to be so unwilling to change that even some established players retreat.

    HPCWire reported that Asetek Announces It Is Exiting HPC to Protect Future Profitability from liquid cooling systems for HPC and refocuses on consumers and data centers only. According to HPCWire, "Asetek has been a mainstay provider of warm water, direct-to-chip liquid cooling technology in use at HPC sites worldwide, partnering with companies such as Cray, HPE, Fujitsu, Supermicro, and Penguin Computing." While allegedly CoolIT has taken up that slack and established companies such as Clustered Systems compete for the existing markets, new entrants such as Ferveret try to convince the industry that immersion cooling is the way to go. Certainly Ferveret will have to weigh the advantages of immersion cooling with the demerits of harmonizing server mainboards against competiting technologies.

    Data Security

    Posted on by: Axel Kloth

    Securing everyone's data is going to be a herculean effort for a variety of reasons. First, the industry has not really put a focus on data security, despite claiming otherwise. Second, there is a fairly fundamental misunderstanding what data security is. Third, the attackers are getting better and as far as I can tell, learn faster and adopt new strategies more rapidly than those that aim to keep the data secure and private.

    There are multiple reasons for all of the above. First all of, data is growing exponentially, and that poses a problem. The number of servers in data centers simply requires automated OAM&P, and as such, tools that help administer hundreds of thousands of servers with one set of credentials for a super-admin. Second, computers and operating systems have reached a level of complexity that makes it nearly impossible to make them impenetrable. As a result, it is always going to be easier to find holes and exploit them than it is to write watertight software on top of un-breachable hardware. Third, the potential payouts and the number of attackers - including nation states - keep growing.

    We are also witnessing that the focus – if data security is discussed at all – is on either data at rest or on data in transit. That’s akin securing the vault in the bank and securing the transport vehicle for the cash and other monetary instruments – but not looking at how to keep the drivers of the armored vehicles and the bank tellers safe. In other words, we try to keep data safe when stored and transported, but not the devices that enable and provide that protection. Breaches are very prevalent and will only be more frequent and more devastating when 5G with its curbside and physically unprotected servers will be ubiquituous. I'll call this Kloth's Fourth Observation: The devices that are intended to keep data at rest and data in transit secure are unprotected and must be secured. So far, the industry has simply ignored the problem of computers and networks being breached. While SANS and MITRE keep a database of vulnerabilities and exploits, we do not see that the manufacturers of operating systems, firmware and even hardware (computers, memory, processors and ancillary chips) have fundamentally re-thought how to protect the computer itself. The thought that encrypting data protects it is simply too short-sighted. If uncrackable encryption is used to protect the data at rest, it is not secure because once the computer has access to said encrypted data, it must decrypt it to use it. Any kernel process in the OS running concurrently with the legitimate process that deals with the now-decrypted data can access it with little difficulty. Unless the keys are unknown to the operating system, such a process might even be able to steal the keys, making it even harder to defend against threats. The military has known strategies to protect theaters for a long time, and the main mantra here is that before you can protect others, you have to be able to protect yourself. Conversely, we must make sure that computers can protect themselves so that only authenticated processes run, for authenticated users, and at times that are defined in service level agreements.

    We have developed technology that enables the processor to defend itself against attacks so that it then can protect data at rest and data in transit. This technology is built into our Server-on-a-Chip and into our HRAM.

    Legal use of encryption at risk again?

    Posted on by: Axel Kloth

    It seems like every single time an FBI Chief is under duress for not doing his/her work, they deflect by insisting that encryption should be outlawed.

    Following tradition, Chris Wray is now doing the same. FBI Director: Ban Encryption to Counter Domestic Extremism. Under pressure for not doing his job with regards to the vetting of Brett Kavanaugh in preparation of the Senate Judiciary Committee hearing and for dropping the ball again on the Larry Nassar case he brings up outlawing encryption. Like before, this is ridiculous, and this man does not know what he is talking about, let alone what he should be doing. There’s No Good Reason FBI Director Chris Wray Still Has a Job. If encryption is outlawed, regular citizens can't protect themselves, but criminals will not be threatened by civil or criminal penalties for using it. Simple logic dictates that. If a criminal expects that if caught and convicted he/she will face 30 years in federal prison, then what would a few thousand dollars in penalties and fines or 3 months in local prison do as a deterrent? Exactly nothing.

    I had alluded to this in my old SSRLabs blog on 2020-10-12 under the title "US DoJ on Encryption — again", copied verbatim:

    Yet again the US Department of Justice (DoJ) tries to pitch End-To-End Encryption against Public Safety. The reality is that the opposite is true. There is no Public Safety without End-To-End Encryption. Predictably, the DoJ brings up exploitation of children to justify restricting the use of encryption. Encryption relies on secret keys or key pairs. The algorithms are standardized. For backdoors to work, a repository of keys and key pairs has to be created. This database will be the most-targeted piece of property ever, as it would reveal all keys from everyone to everyone else using encrypted communication. Whether this database is a collection of databases by each provider or a centrally and federally managed database does not make a difference. It will be breached. I do not want to go into any more detail here, and anyone who wants to dive deeper is invited to ping me. I promise to return email requests. I'd like to make it very clear: Backdoors to encryption are not needed and are dangerous. This renewed attempt of pushing legislation through that restricts encryption must be stopped.

    and here on 2015-07-16 under the title "Encryption at Risk?", again copied verbatim:

    I am not quite sure what to think of the recent statements that the director of the Federal Bureau of Investigation (FBI), James Comey, has made. According to The Guardian, James Comey, FBI chief wants 'backdoor access' to encrypted communications to fight Isis. To me it looks like he is looking for a justification to first ban and later on outlaw strong encryption without backdoors. This is confirmed reading the statement right from the horses' mouth here: Going Dark: Are Technology, Privacy, and Public Safety on a Collision Course?. Newsweek confirms this interpretation here: FBI's Comey Calls for Making Impenetrable Devices Unlawful. Well, I am not a fan of backdoors. I think that encryption is good and backdoors are bad. The reason for that is very simple. Strong encryption protects you and your privacy. You do not send a piece of important information on the back of a postcard - you put it into an envelope. You do not hand this envelope to Shady Tree Mail Delivery Brothers to get it to the recipient. You drop it into a mailbox of the USPS, Fedex, UPS, DHL or the like, expecting that they do not open the envelope. With the delivery contract, you have a reasonable expectation of privacy. On the Internet, there is no expectation of privacy. If you want something to be delivered such that no one in the path of the transmission from you to the recipient can read the contents, then you need to be able and have the right to use strong encryption to ensure that despite the open nature of the Internet no one can snoop. It also should be up to you to determine what is worthy of protection and what not. If I send an email to a supplier asking if they would like to do business with me, then I do not need any encryption. However, if they agree and they send me back a quote, they sure do not want their competitors to be able to intercept and evaluate their quote and possibly undercut that quote. They have a reasonable interest in protecting their quote. Now let's assume that we have a new law in place that allows strong encryption but requires you to accept a backdoor into your encryption with the backdoor keys being held at a government location. Why is that a bad idea? Well, for starters, the biggest focus of any hacker will be this repository of keys to the backdoors. Any hacker on the planet - good or bad, capable or incapable, ethical or not - will attack this repository. Brute force attacks and social engineering and many other attack methods or simply sheer luck will be used to get in. It is unrealistic to assume that such database can be protected, and it is naive to pretend that a mechanism providing a backdoor cannot be exploited. If history has proven anything then we must assume that encryption with a backdoor is useless as both the backdoor mechanism itself and the centralized repository for the backdoor keys are vulnerable and will be cracked. We know that the likelihood to break into the repository of keys for the backdoors is 100%, no matter how protected this database is. With the repository of keys to the backdoors in an unknown number of unknown hands encryption becomes useless as any crook and any unethical person has access, and the ethical and good people are being betrayed. That's akin to putting every criminal on the streets and every law-abiding person in prison. Is that what the US government and the FBI want?

    To me, it seems like the US needs another amendment to the Constitution, explicitly declaring the use of encryption legal. I am sick and tired of explaining secure communication to people in power without any understanding of technology and its implications. Again, even secure communication will have to have its metadata in plain text visible to any observer, and as such metadata is enough to catch and convict criminals. Insight into the ciphertext is not needed. After all, we do not outlaw the use of letters and only allow the use of postcards.

    Apple is looking into RISC-V

    Posted on by: Axel Kloth

    Apple is looking for RISC-V designers according to Tom’s Hardware. It certainly took Apple a while, and I had predicted that it would happen here. It just amazes me that it took so long. After all, Apple was one of the first investors in ARM when they decided that the Apple Newton was a good idea, and put an ARM 710 into the PDA. I had one of those, and while the idea was great, it simply needed vastly more computational performance than the ARM 710 could deliver.

    I am not sure if Apple remained a shareholder in ARM, or if SoftBank bought out everyone, but in either case it is time to get off the ARM train. I have had more than enough of what used to be Acorn RISC Machines.

    I think that RISC-V is a much more modern and advanced RISC processor, and its ISA is open source. Its ecosystem has grown at a phenomenal clip, and it has the potential to displace ARM. I would like to see Apple join the RISC-V train and help everyone build out the ecosystem for firmware, software and tools such as non-GPL compilers – LLVM/CLANG comes to mind.

    Hyperconverged Servers 2

    Posted on by: Axel Kloth

    As usual in IT, the pendulum swings in one direction to its extreme, and then back to its other extreme. We have seen disaggregated servers (compute nodes on one side, storage on the other), and then came the hyperconverged servers. Then it swung back to disaggregated systems, and now we are back with hyperconverged systems. It seems that the transitions are arbitrary, but in reality, they are not. The Next Platform questions why "If Hyperconverged Storage Is So Good, Why Is It Not Pervasive And Profitable?". I think that for this article, two reasons exist that would explain the question and frustration. First, the author appears to focus on just one company. Second, there are good reasons for both disaggregated systems and hyperconverged systems.

    Let me first explain why a disaggregated solution might be beneficial. If compute nodes have to be physically close to some experiment (like at CERN), then space constraints may mean that disaggregated systems are the only choice. If large amounts of data are created and processed, and that data crunching is computationally very intensive, and then that input and processed data must be stored or archived, then again a disaggregated system might make sense.

    If on the other hand the bandwidth needed from compute to storage and back is so high that networking is not fast enough, then hyperconverged systems are the only way out of that predicament. Hyperconverged systems also have the disadvantage that distributed storage (possibly even across continents) for disaster resilience is harder to implement. A little more background can be found here.

    Another issue is that the author focuses on one company, and not on the segment. Maybe for the segment the revenue and profitability data is better. I have no insight into it, but I'd verify from other independent sources if in fact the entire segment is doing so badly.

    VCs don't seem to fund cybersecurity companies

    Posted on by: Axel Kloth

    We see breach after breach after breach. Current hardware, firmware and operating systems as well as application software are not able to stem the tide. I am not sure what to make of it, as usually when there is demand, and a new product or service is available, customers come. A new product and lots of potential customers are generally what VCs salivate over. For reasons that are beyond my comprehension, that's not the case for cybersecurity hardware, firmware and operating systems as well as application software according to VentureBeat. I am at a loss. What gives?

    The market is there as there is plenty of demand for cybersecurity. If the existing solutions worked, we would not see a continued problem in cybersecurity. While some of the breaches are due to incompetence and social engineering, the vast majority of breaches is due to exploits of weaknesses in all of the areas mentioned above.

    IBM's Telum introduces a novel Cache architecture

    Posted on by: Axel Kloth

    It looks like we are not quite alone in pointing out that the current architectures for caches are fundamentally broken. IBM has unveiled its newest mainframe processor, and it does away with L3 and L4 caches. IBM's Telum is described here in detail, and the one thing that surprises is its new cache architecture. AnandTech expands on this a bit and explains why IBM may have chosen this path.

    The most important takeaway is the same we have claimed for a while: caches are a band aid to mask the latency differences between a CPU core and memory. Caches do not contribute to any kind of computation. They simply hide the latency of main memory.

    We are glad to see that we have confirmation for our thesis.

    DRAM versus NVM

    Posted on by: Axel Kloth

    We have yet another point of reference that DRAM is too expensive even for Facebook. It seems as if Facebook is using (or evaluating the use of) non-volatile memory for cachelib instead of DRAM.

    Our HRAM delivers more than DRAM performance at densities of Flash, and at a cost comparable to DRAM. I had alluded to this as it was clear from our simulations and matched results from Arvind at MIT here.

    HPC is broken

    Posted on by: Axel Kloth

    The Next Platform stated in all capital letters that Dimitri Kusnezov, a Department of Energy (DoE) expert on AI, at HotChips 2021 stated that DOE AI Expert Says New HPC Architecture Is Needed. The DoE is responsible for all HPC efforts of the US government, so hearing from him that HPC is broken reaffirms our position. We have said that for quite a while, and so far, most technical experts agreed with us, but the sentiment of those people holding the purse strings was mostly that "it seems to be working fine". It was not, and it is not. Something fundamental has to change. With Kusnezov and the DoE agreeing that it is broken we believe that money will be made available to finally fix HPC for good. We are ready when they are.

    Here is an excerpt from his talk at HotChips 2021, as recorded by The Next Platform: But the highly complex simulations that will need to be run in the future and the amount and kind of data that will need to be processed, storage and analyzed to address the key issues in the years ahead — from climate change and cybersecurity to nuclear security and infrastructure — will stress current infrastructures, Kusnezov said during his keynote address at this week’s virtual Hot Chips conference. What’s needed is a new paradigm that can lead to infrastructures and components that can run these simulations, which in turn will inform the decisions that are made. “As we’re moving into this data-rich world, this approach is getting very dated and problematic,” he said. “Once you once you make simulations, it’s a different thing to make a decision and making decisions is very non-trivial. … We created these architectures and those who have been involved with some of these procurements know there will be demands for a factor of 40 speed-up in this code or ten in this code. We’ll have a list of benchmarks, but they’re really based historically on how we have viewed the world and they’re not consonant with the size of data that is emerging today. The architectures are not quite suited to the kinds of things we’re going to face.”.

    In other words, HPC was broken before AI made the problem so big that it cannot be ignored any more. To address the computational challenges from climate change and cybersecurity to nuclear security and infrastructure, among others, we will need a new HPC architecture that can deal with extremely large data sets and a speed-up of computation, I/O and storage.

    We are working with our partners to advance HPC towards a new paradigm. We even bridge the legacy world to the novel system architecture, and we add resilience and security features to HPC without giving anything else up.

    RISC-V HW Support for Virtualization

    Posted on by: Axel Kloth

    RISC-V is one of the most important novel processor Instruction Set Architectures (ISAs) of the last decade. It is well thought through, allows for very small implementations of a processor based on the ISA, and due to the fact that the ISA is open source, anyone can implement his or her own processor. What so far has been missing is hardware support for virtualization. We have implemented our version of HW virtualization support as we did not want to wait for the RISC-V steering committee to come up with its recommendation.

    We have developed a method that is universal, future-proof, provides ample performance for virtualization and a hypervisor, and has a software and hardware interface to an IOMMU. We are happy to share this and license this technology on a FRAND (Fair, Reasonable And Non-Discriminatory) basis with any other RISC-V processor company.

    HPC TAM

    Posted on by: Axel Kloth

    The Next Platform is reporting on the HPC TAM and its forecast for the next few years. It alludes to the fact that all assumptions point to The Rapidly Expanding And Swiftly Rising HPC Market because the number of HPC applications grow, and supercomputers are projected to become more affordable.

    We have been saying for a while that the market is already >$5B annually for just the semiconductor components in supercomputers today, but we project a very high growth rate of that TAM. This article confirms and in fact assumes a TAM growth that is higher than our estimate.

    The Future of APIs at CSPA

    Posted on by: Axel Kloth

    I am extremely honored to have been invited to give a talk about The Future of APIs for Accelerators in Open Source at the California Software Professional Association (CSPA).

    If you are even remotely interested in APIs, accelerators, Open Source or the CSPA, please sign up for this event.

    Big and slow is faster than small and fast

    Posted on by: Axel Kloth

    I had searched for this article and finally found it (again). It may sound counter-intuitive, but big and slow is in fact faster than small and fast.

    Why is that the case? The reason is the enormous discrepancy between the throughput of a processor compared to DRAM, and DRAM bandwidth and latency compared to Flash memory. Processors today crunch through data incredibly quickly. In fact, they are so fast that SRAM caches are needed to hide the DRAM latency. If the processor cannot find the data in its cache, it will try to retrieve it from DRAM. It will take a penalty of many cycles to do so. If the data is not in DRAM, then it had been swapped out to disk in the past, and now the processor has to wait even longer as accessing a disk is painfully slow. Therefore, it makes sense to avoid having to access disks altogether. That is not quite possible yet as disks are dense and cheap, but reducing the frequency at which a processor has to access disks enhances performance. Therefore, exchanging expensive DRAM with cheap and slow (but still faster than disk) and much larger Flash memory makes sense.

    For Big Data, it is not surprising that big memory beats small memory, even if the smaller memory is faster. This is exactly what Arvind Mithal at MIT has proven. Arvind found that the size matters more than bandwidth and latency in Big Data applications, and that is why the Flash-based cluster was as fast, if not faster, than DRAM-based clusters of servers. On top of that, the Flash-based cluster was cheaper. The reason for that is fairly simple as more memory means that the processors have to go to even slower disks a lot less often than they'd have to do with faster but much smaller DRAM memories.

    This mirrors our research data and convinces us even more that our HRAM is the right direction to go as that combines the benefits of DRAM and Flash.

    LinkedIn Connection Requests

    Posted on by: Axel Kloth

    I am really getting sick of the behavior I see with a number of LinkedIn members.

    Two particular groups rile me the most:

    Group 1 includes students who are too lazy to check the career page on our web site on how to find out how to apply for an internship, and instead of finding out how to apply they send a connection request.

    In our experience, those interns rarely ever turn out to be interested in learning. It is unfortunate but a predicator of their performance, and therefore we do not get those on board any more.

    Group 2 consists of sales people who request to connect to offer "how to explore synergies between our companies" when in reality they want to sell something.

    The sales calls are more annoying. Let's say I find company A on LinkedIn, and they have a VP Business Development. My company B offers a service or product that is complimentary to company A's product or service. In other words, there is plenty of overlap in the TAM (market and customers), and our and their product and service complete each other. In that case we do not only not cannibalize each others' service or product, we offer a more complete solution so that a customer simply gets a better solution. As a result, our common customer benefits, we together have more customers, and both companies make more money - individually and as a synergistic group. In case of a synergy, we end up with more money in the bank, and so does the synergistic partner. The customer ends up with a better solution, and as a result, there are more customers with better margins per customer. That's synergy, and if I cannot find the VP BizDev email on their web site, I request a LinkedIn connection. I can usually make the case, and in most cases I end up with a good connection.

    What is not synergistic is a sales call disguised as synergy. If my company's money ends up in your coffers because you sold me something then that is a sales call. There is no synergy.

    Consequently, if you want to sell me something, call it that, and tell me exactly what the benefits are for me. I have nothing against a sales call if you make a good argument. If you did not look up what I do and I receive a generic sales pitch, rest assured I will immediately remove you from my contacts. Same for sales pitches disguised as synergistic deals: I will remove you.

    Here you have it. You cannot say you have not been warned.

    Supreme Court Ruling on Oracle vs Google

    Posted on by: Axel Kloth

    CNN Business says that Supreme Court hands Google a victory in a multibillion-dollar case against Oracle.

    I'd rephrase it. I’d say that the SCOTUS has made the use of APIs reasonable. It is a somewhat difficult topic, so I will try to explain it in more simple terms.

    Let's say I am developer of software or hardware. Let us assume that hypothetically, I have found a new way to execute the square root function both in hardware and in software better than anyone else. I also want people to use my new square root function. So I publish an application programming interface (API) for it by defining y := sqrt(x), and I define what the argument values x and y are, and I define the types of representation (integer, floating point, UNUM, POSIT, and their respective lengths). In other words, I publish the API y := sqrt(x) for everyone to use so they do not have to invent yet another square root algorithm.

    The inner workings of my hardware or software that are called by the API are not visible to anyone without de-compiling or disassembling them. In general terms, they are binaries or executable CPU-specific machine code, or they are calls into a specific piece of hardware that I may have developed. In my function, I first check if the hardware is present to execute the sqrt function. If so, I hand over the input data x and wait for the hardware unit to be ready, and when it hands me the result back, I put this into y so that the calling software can use y. If I do not detect the hardware needed, I execute sqrt in software, and then hand over y as before.

    I can hide the function for sqrt and never have to expose how I do it. If someone comes up with a better implementation of the function, then they likely at some point in time will displace my function - whether they call it sqrt or not. In no case should I be able to sue anyone just because they call their function sqrt unless I trademarked it to avoid diluting my brand. However, if I do not trademark it, and the sqrt function call name should not be patentable, then anyone can use sqrt, create another better sqrt, or create something similar that encompasses mine.

    That is very different from stealing how I execute my sqrt operation inside my hardware or software. If someone does that, then I should have the right to sue them for IP theft.

    If someone merely re-implements an existing solution with the same APIs or simply uses the APIs via function calls from an application software, then that is the intended purpose of an API, and it should not be subject to a legal battle.

    That was what the Google versus Oracle suit was all about, and rightfully Google prevailed.

    Robustness and EDC/ECC in Memory

    Posted on by: Axel Kloth

    When we look at Supercomputers, we look at the culmination of thousands of cores, Terabytes to Petabytes of main memory, and Zettabytes of mass storage. If we want the results to be mathematically correct, we first of all have to use a number system that allows mathematical precision to the degree needed. Recent research by Professor John L. Gustafson has made it abundantly clear that Floating Point math does not quite work as well as we had assumed. UNUM and POSITs are much better representations for numbers in registers that are inherently limited in length. You can read up on UNUMs here: The End of Error: Unum Computing - CRC Press Book.

    Google has found spurious errors in CPUs executing arbitrary functions, FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof indicating either production, clock frequency or inherent design issues at modern process technology nodes used in production. Memory has grown to sizes where even an error rate of 1 in 1015 is not good enough any more. As a result, we must make sure that we account for these problems properly. CPUs must be designed with more focus on correctness and on robustness. Memory must be built such that autonomous verification of the correctness of stored information is guaranteed. Verification of the stored contents must be done without CPU intervention, and detection of multi-bit errors and correction of single- and double-bit errors is done inside the memory subsystem itself, and not only during those times when data is written to or fetched from main memory. In other words, main memory has to become smart. It has to be able to scrub itself periodically, it needs to detect spurious and persistent errors and flipped bits, and it has to be able to correct spurious single- and dual-bit errors autonomously. It should be able to take persistent defective cells or pages out of operation without CPU intervention, and it must remap spare memory into that affected address space. The host processor should be informed of this fact, but its memory space must not be affected in any way. In essence, the remapping should be done such that it is invisible to the operating system memory management tables and the processors' address space, but visible to the processor's and operating system's OAM&P software.

    DRAM itself has design vulnerabilities that can be exploited. These are predominantly what is called Rowhammer and Half-Double. TechExplore published that Google announces Half-Double, a new technique used in the Rowhammer DRAM security exploit. Google itself published this exploit here. The story was picked up by Gaming on Linux, by ZDNet, by Wired and by TechRadar.

    Unfortunately, it does not end there. Despite the introduction of "Secure Boot", the BIOS remains vulnerable. Since the BIOS contents is non-volatile, any attacks against the BIOS and a successful overwrite will install a persistent threat. As an example, ThreatPost claims that 30M Dell Devices at Risk for Remote BIOS Attacks.

    It is therefore imperative that we design a better memory subsystem that offer better performance, supports more linear scale-out in performance, provides better robustness against spurious errors, can detect multi-bit errors autonomously, can remap memory areas without CPU intervention, shields against RowHammer and Half-Double attacks, and incorporates defenses against BIOS attacks.

    Supercomputer "Speeds"

    Posted on by: Axel Kloth

    I always hear people talk about "Supercomputer speeds". Usain Bolt is fast for a human. Cheetahs are fast. Peregrine falcons are fast - at about 320 km/h or 200 MPH they are incredibly fast. But that is not the type of "fast" that we are looking at when judging computers. Computers are rated based on the amount of computational problems they can solve in a given period of time. Certain "benchmarks" have evolved over time to make such determinations. Those benchmarks include but are not limited to LINPACK, BLAS (Basic Linear Algebra Subprograms) and DGEMM/SGEMM (double and single precision Matrix Multiplication). Both BLAS and DGEMM as well as other benchmarks define a certain size of the matrix (or matrices) for the multiplication to be carried out. Since BLAS and DGEMM make extensive use of what in computer science is called a fused multiply-add or short FMA this is the instruction that CPU designers optimize most. For any given size of the matrix the multiplication requires row elements to be multiplied by column elements. These are usually neatly arranged in memory and are loaded by the Cache Controller in blocks called Cache Lines, to circumvent the high latency that DRAM (Dynamic Random Access Memory, the main memory) has. Therefore knowing the maximum size of the matrix and optimizing the Caching algorithm for the matrix size and having an efficient FMA can be used to create benchmark results that oftentimes cannot be replicated in real life applications. A big problem with that approach is the fact that while caching itself is not bad, it leads to the design and the deployment of caches that are larger than they need to be compared to a better-balanced system. Since Caches consist of very fast transistors (particularly the TCAM or tag RAM in them) they consume a very large portion of the energy that the processor uses. However, Caches are only masking latency differences between the internal registers of processors and the main memory they use, which is DRAM. Caches do not compute, and they do not create any computational results. Partially because of BLAS and DGEMM the industry has increased Cache sizes more and more, and focused less on improving interconnects and bisection bandwidth. However, that has left us in a situation in which we have very large Caches in the processors and accelerators, and we bank on those Cache Controllers to have pre-fetched the proper data instead of making sure that the interconnection bandwidth between processors is high enough to enable any-to-any core access without having to go through remote memory and thus relying on the efficiency of the Cache Controllers and their algorithms and policies for caching and aging.

    For n-body, FEM and FEA and all other computational problems with nearest-neighbor interaction the only metric that counts is bisection bandwidth and latency. That cannot be resolved with caching. A novel architecture is required. That's the reason why we do things differently.

    Just to recap, here is how outsourcing some of the computational work from a CPU to an accelerator or a coprocessor works:

    • A piece of software that was written to distribute tasks between CPU and accelerator identifies a task that benefits from execution on an accelerator
    • As soon as all input data and accelerator instructions are available, the CPU (or a DMA Controller or IOMMU) transfers the data to the accelerator or its memory
    • The CPU instructs the accelerator what to do (basic or compound math functions) in the accelerators' instruction set
    • The accelerator crunches through data
    • While the accelerator finishes its task, the CPU core(s) can continue executing other work
    • Most often during those times, the host CPU retrieves and prepares new data
    • When the accelerator is done crunching, it issues an interrupt request to the host CPU
    • The host CPU responds to the IRQ and retrieves data from the accelerator via software or through a DMA Controller or IOMMU
    • This sequence continues until all data is processed

    To accelerate the execution of instructions, nearly all accelerators are created from combinatorial logic, i.e., they do not use a multi-stage pipeline like processors do. While an accelerator built from combinatorial logic is effectively a fixed-function device, it will use a lot less energy to execute a particular task compared to a programmable processor, and usually it will do so in less time, it is of course less flexible. It will execute the function it was designed to, but it is not programmable. In other words, if such accelerator is designed to execute function xyz, and xyz is deprecated, then it cannot be used to execute function abc instead.

    That is one of the reasons why most accelerators focus on optimizing the execution of well-settled functions that do not change. Examples are matrix math, tensor math, finite element analysis, any kind of transforms such as Fourier Transforms. There are a few more computational problems that have not changed in a long time, and most cryptographic functions fall into that category. However, with the advent of quantum computing it is unclear if AES, SHA-2 and SHA-3 will have a future, and as such, they fall into a category that is not actively pursued much outside of cryptocurrency mining and blockchain verification. Cryptanalysis will continue to be of interest to a lot of organizations.

    Looking at above list also makes one other implication very clear. There is a considerable amount of administration required to farm out a computational task. As a result, only tasks that are fairly lengthy to execute on a general-purpose CPU should be offloaded to an accelerator. If the administration of farming out takes 100 cycles, then the task that is farmed out should save at least 1000 cycles, or its benefits are drastically diminished. As a result, we see mostly fairly coarse granularity of farmed-out tasks.

    How an Interview can go wrong

    Posted on by: Axel Kloth

    I have been interviewed many times in my career. The vast majority of journalists gets it right, sends you an upfront draft to review and edit if so needed, and then publishes the article with the quotes. However, sometimes it goes terribly wrong, so be careful who interviews you, and who quotes you. I have included one example that still irks me today as the alleged quote is not only factually incorrect, but more importantly, I have never been interviewed by the person who claims this quote.

    Here is what I told a journalist who interviewed me. This is the reviewed and agreed-upon text:

    If you receive a postcard, ask yourself what part of the data on the postcard is correct and trustworthy. Usually, on a postcard you'll find your (the recipient’s) address, some text or advertisement from someone, and possibly a sender’s address. It might even have a stamp. Out of all this data, the only information you can trust to be reasonably correct is your address. If it were not, you would not have receive the postcard. The text on the postcard can be fully made up and thus cannot be trusted. The same is true for the sender's address and name. Neither one of those pieces of data have to be present or correct for the postcard to arrive at your address. As such, you can't trust much of the data on a postcard. The situation is not much better with a letter. If inside the letter there is a verbatim copy (or even better, an encrypted version of the sender and receiver data based on a pre-shared password) of the recipient's and the sender's address, it gives you a little more confidence that these pieces of data are correct, but not much. After all, envelopes can easily be opened and re-sealed. The equivalent assumption of confidentiality and correctness and authenticity can be made for any kind of unencrypted communication over the Internet.

    As far as I know, the agreed-upon article was never published. Somehow, the interview text made it to someone working for an Indian newspaper, and that person not only entirely butchered the text, but made it entirely wrong. This is what was published by a person who never talked to me, including a quote that I never gave:

    "Say you receive an envelope and on that envelope is your name and address, a return name and address and a postmark," he explained. "You can authenticate the recipient with surety. If you know the person on the return address you might know who isn’t it, and the postmark gives you some idea that the government has properly delivered it. But you do not know if the envelope contains a letter or anthrax powder. If you cannot authenticate each part of the delivery mechanism you don’t have security."

    As you can see, the alleged quote has nothing to do with what I said. It is factually incorrect, it adds items that I have never said, and a style analysis reveals that in fact this is not in line with any of my other statements or posts. The problem is of course that this entirely made-up "quote" is out there, and there is nothing I can do to make the journalist retract it.

    HPC, System Uptime and Security

    Posted on by: Axel Kloth

    There is one recurring question I get, and that is “Why if you are an HPC company do you care about or post on security and system availability/uptime issues?”

    Let me explain why. First of all, we are not a security processor company. Second, we believe that security and hardening measures must be included in all new hardware, firmware, operating systems and application software. Third, we see that with 5G and Fiber To The Home (FTTH) and Fiber To The Curb (FTTC) the attack surface for the Internet backend has grown and will continue to grow, and the available bandwidth for attacks to take out large portions of the Internet infrastructure has increased to a degree that the so-called “edge” will have to take a role in protecting the backend.

    The magical “edge” is the collection of curbside compute that allows for 5G and FTTH and FTTC to be viable and useful. Edge computing reduces latency to the user as some of the preprocessing is executed there, and low latency is crucial for a wide variety of applications – gaming and Advanced Driver Assistance Systems (ADAS) as well as Vehicle-to-anything (V2X) included. While most compute will still occur in the backend, the edge will become more powerful, but unlike the backend, the edge is not physically protected. As a result, any attacker will have relatively easy access to the compute power at the edge. Because of the compute performance and the bandwidth of edge devices, an attacker can wreak havoc on the Internet backend, including supercomputers. Therefore, we believe that the edge must be able to protect itself and the backend from attackers as much as possible. A botnet of edge devices will bring down the Internet.

    We have seen the tactics of attackers change. We went from Denial-of-Service (DoS) and distributed Denial-of-Service (DDoS) attacks to attacks against the DNS system, and while the Internet did not go down, many users experienced it as such. While we have our own DNS servers, and we were able to continue to work without interruption, many other users were not. With more bandwidth and compute power at the edge, an attacker has a much higher chance to disrupt Internet access and Internet traffic.

    That is why we focus on the system availability including security measures we need to take to protect systems built with our processors. If a system is under attack, if its incoming links are unavailable, if it has been forced to reboot or is in a rolling recovery situation, it is not available for its intended tasks. This is all counted towards system downtime, and HPC time is expensive and precious. As a result, we try to make sure that systems built with our processors have the hardware and firmware robustness needed to withstand an onslaught of attacks from compromised edge devices, and ideally, we’d work with those edge devices to avert a situation in which a large number of edge devices is compromised in the first place.

    Due to its enormous compute power and bandwidth we need to make sure that compromising a supercomputer cannot occur. Any computer is most vulnerable to attacks during boot time, when most hardware and software security mechanisms are not yet active. We are working on solutions to plug those holes. We are also working on a more active firewall. For example, if someone pings us, we do not respond. If from the same IP address we then see a netscan, that event is logged, but today manual intervention is needed to block that IP address. That does not make sense – the firewall should block that IP address after one ping (or traceroute) and one netscan attempt all by itself.

    Aside from those reasons, I am as annoyed as anyone else if the Internet is not available or upload or download speeds are abysmal – even at home.

    HPC as a Service

    Posted on by: Axel Kloth

    It was only a matter of time until somone would announce and implement HPC as a Service (in the Cloud, of course).

    Fundamentally, a supercomputer today is the same as any data center operated by the hyperscalers and all of the Cloud providers. The infrastructure in both cases consists of Commercial Off-The Shelf (COTS) x86-64 servers, connected via COTS top-of-the-rack switches and a COTS global switch connecting the top-of-the-rack switches, plus an equivalent storage subsystem, and some of the general-purpose CPUs may be accelerated via General-Purpose Graphics Processing Units (GPGPUs).

    The only difference is in the interconnect, which in the case of the hyperscalers usually is 10 Gbit/s Ethernet, whereas the Supercomputer faction typically uses lower-latency and higher-bandwidth InfiniBand with DDR or QDR data rates of 20 or 40 Gbit/s. As a consequence, it does not take much to change a portion of a hyperscalers' data center over to support both Ethernet and InfiniBand, and sell the CPU hours at a much higher price to HPC users compared to normal users.

    Here at ZDNet is one of the stories about HPC as a Service.

    Intel, GlobalFoundries and SiFive

    Posted on by: Axel Kloth

    Intel has been in the news a lot lately. That is largely due to Pat Gelsinger rejoining Intel. I have the upmost respect for Pat, and I know that he is up to quite a challenge.

    I don't know how to say this this any friendlier than this, but at this point in time Intel is a half-trick pony. Intel has x86-64 that is slowly but surely running out of steam, and a few failed expensive acquisitions that don't play well together. MobilEye and Altera really don't complement each other, and I do not see how Nervana and Habana could augment each other's offerings. Also, Intel just discontinued Itanium, and while that was a good and overdue step, it reduces the options in processor ISAs that Intel has available to itself.

    Then there is the fact that in all large organizations complacency sets in. In the beginning, every startup will attract people who want to do something exciting and new, and they work their butts off. Once a company is established and generates profits, it will attract a different kind of applicant. They are usually seat warmers, and their only purpose is to collect a paycheck. They are the status-quo people who object to any change. They justify their behavior to themselves and to others by stating that someone has to keep a straight course. They know or at least suspect that their behavior might sink the ship, but then do the math and quickly come to their conclusion that the 15 years they need to retire is not enough for the mothership to go belly up.

    Pat will have to clean out the house. I assume that in due time, we will see a RIF to get rid of people who do not contribute. He is in a much better position to identify those than the prior CEOs who were deeply non-technical. He will need chip manufacturing capacity, and that is where the GloFo acquisition comes in handy. He needs a different processor architecture as it is certain that x86-64 will run out of steam. That is where SiFive fits in, and I had alluded to that earlier here: Intel interested in SiFive. GloFo will also help Intel to be a more customer-centric organization, and while GloFo itself is not the best at that, it will be a wakeup call. Intel will be able to shift production around: datacenter CPUs and accelerators on its newest nodes, and all supporting and peripheral chips in the older fabs from Intel and in the newly acquired GloFo fabs. Then slowly build out ex GloFo, bring them into the FinFET and GAA age, and then reap the benefits of using the tools that external customers use inside Intel as well. I suspect that at this point in time, Intel spends a fortune on tools developed in-house for the design and the DV (design verification) of its processor and peripherals. If Intel can switch over to commercially available tools from Synopsys, Cadence and Siemens Software (ex Mentor) as well as from AnSys, then that would save tons of money, it would allow Intel to attract design engineers from the outside and get them productive without retraining on its internal tools.

    Intel has discontinued Itanium

    Posted on by: Axel Kloth

    Over 20 years ago, Intel introduced a non-x86 processor, the Itanium. This was a collaboration between HP and Intel for a successor to the Intel x86 architecture. HP contributed what was HP PA (Precision Architecture), and Intel added EPIC (Explicit Parallel Instruction Computing). EPIC was a departure from traditional CPU design as it relegated a lot of the work of parallelizing instructions to the compiler instead of hardware within the CPU. Intel promised improved performance over the traditional design philosophies, and positioned Itanium above anything x86. Unfortunately, those promises never materialized. Itanium not only was unable to ever surpass x86 in performance in emulated or in native mode, and even native Itanium applications never were able to drastically outperform any other competitor. Part of the problem was the complexity of the compiler, and Intel never managed to create a compiler that was fully able to extract the theoretical benefits of EPIC.

    While Intel had announced the discontinuation of Itanium years ago, its eventual demise came quietly, and would have gone mostly unnoticed had not publications such as TechSpot announced it. Most comments were not exactly kind to Itanium, as we can see here: Intel's Itanium is finally dead. To a large degree, that criticism is well-deserved, but Intel certainly deserves some credit to introduce a new processor architecture in which parallelization was done in an unconventional way.

    While some put the blame on the lack of compatibility to legacy x86 (yes, the 32-bit version, not x86-64) software, I don't think that this is correct. Had Itanium shown better performance in native mode over any other processor or over the emulated x86 code, then I am fairly certain that Itanium would have succeeded despite using a different ISA. But it never did. I think that both the price point and the performance (or lack thereof) had to do with its demise. In the end, only HPE used Itanium - and for a good reason.

    Rest in Peace, Itanium, and hopefully with EPIC at your side. We will bury you next to SPARC.

    HPC applications with impact

    Posted on by: Axel Kloth

    HPC is not an easy topic to explain. I frequently encounter people asking me what I do. Here is a brilliant article outlining what HPC has been able to do for humankind lately, and that article is only focusing on 6 applications that saved lives and make our world a better place.

    GlobalFoundries going IPO

    Posted on by: Axel Kloth

    In the past few weeks, there were lots of news about GlobalFoundries. One rumor stated it was going to be acquired by Intel. The Wall Street Journal reported on it here: Intel Is in Talks to Buy GlobalFoundries for About $30 Billion. Forbes tried to find a good reason for this acquisition in Intel’s Possible Rationale For Buying GlobalFoundries, Inc., and even The Next Platform was not quite sure if the idea was that great. Another Crazy Idea: Intel Might Buy Globalfoundries is not exactly an endorsement.

    Its CEO Tom Caulfield responded nearly instantaneously here: CEO of ex-AMD fab GlobalFoundries shoots down Intel buyout and stated that it was planning on an initial public offering (IPO). My opinion is that very simply, GloFo is not set up for an IPO. GlobalFoundries was created by spinning off the former AMD chip manufacturing unit, and over time it added more and more fabs to the portfolio, including the former IBM Microelectronics. They all used specialty processes, and they had to be integrated with each other. In the process of doing so, it lost sight of the need to perpetually improve. GlobalFoundries was left with an embarrassing acknowledgment that it would not pursue any process nodes beyond 10 nm. While that is a FinFET process, and it has licensed Silicon-on-Insulator (SoI) from ST Micro (and later on from Samsung), it is not a leading supplier for the manufacturing of processors and accelerators. In essence, it now has planar transistor processes, one or two SoI processes, and one or two different FinFET processes. I do not see that they can support their customers with Process Development Kits (PDKs) with the current level of revenue. Simply put, trailing process nodes command commodity pricing.

    Hyperconverged Servers

    Posted on by: Axel Kloth

    Everything old is new again. The first supercomputers were machines that were very different and distinct from any regular computer. They were different from minicomputers, mainframes and from PCs. At that time, internal interconnects were vastly faster (bandwidth and latency) than anything a network could offer, so compute and storage were in the same machines. A supercomputer would contain not only the compute subsystem, but also the storage subsystem. The same applied to minicomputers and PCs.

    Then, it turned out that more bang for the buck could be had by not building dedicated and special-purpose supercomputers. Instead industry standard servers were used to create a supercomputer. They were just regular PC-based servers connected via the fastest interconnects that could be had. That worked out well for a while, but disks were very slow, and tapes even slower. Since data management was needed, and that alone was computationally moderately intensive, storage management was put as software onto industry-standard servers to take over all storage management tasks, and so the storage appliance was born. This allowed supercomputers to be logically divided into two different classes of computers: the compute clusters and the storage clusters. Compute clusters focused on compute (i.e. lots of CPUs and accelerators and memory), and the storage appliances in the storage clusters took over all storage tasks, including snapshots, deduplication, tape operations, backup and restore and of course disk caching. Doing so cut down on cost while improving performance. It worked well for as long as the network was faster than cached disk I/O. The advent of Flash in the form of SAS- or SATA-attached SSDs started to change this. PCIe-attached storage provided level of performance that network-attached storage simply could not match any more. PCIe Gen3 in the 16-lanes variant tops out at less than 16 GB/s, and so that is the limit that can be achieved using PCIe-attached Flash or network-attached storage. As a result, all high-performance storage was pulled back into the compute nodes, and only the nearline and offline data storage on SATA disks and tape as a backup is now left on storage appliances. In essence, what used to be a performance-enhancing technology now has become simply a bulk storage and backup/restore technology, possibly with features for deduplication. This has now caused a simple convergence of compute and performance-oriented storage, creating the hyperconverged server.

    High-Performance RISC-V Cores

    Posted on by: Axel Kloth

    Multiple media organizations including Heise in Germany have reported that China (more specifically the Institute of Computing Technology at the Chinese Academy of Sciences or ICT CAS) has built a high-performance RISC-V core. Even The Register reported on this, under the headline “Chinese chip designers hope to topple Arm's Cortex-A76 with XiangShan RISC-V design”.

    The processor is named XiangShan, and ICT CAS faculty have posted its entire CHISEL/Scala source code on GitHub. There is also a fair amount of documentation, some of it in English, and a few schematics and block diagrams that go with the basic architecture. I have not had time to synthesize the processor and verify the performance (that will take me a while anyway), but from the text and the schematics the claims of the performance make sense. It is an out-of-order design. The pipeline depth of 11 stages seems to be confirmed by the documentation and the schematics, and so is its six-issue width. It seems to rely on 4 DDR DRAM Controllers, and I am still trying to find out if the DRAM Controllers are part of the design. In the original SiFive and UC Berkeley designs, the DRAM Controller is not included.

    Without knowing anything about the DDR DRAM Controllers, it is hard to estimate what the performance is going to be. After all, most modern CPUs outperform their memories, and therefore the caching strategies (Cache types and hierarchies and Cache Controllers) and the DRAM Controllers are an integral part of a processor's performance.

    If the DRAM Controllers are not part of the design, then commercially available DRAM Controller from Synopsys or Rambus (ex NorthWest Logic) can be used. However, that takes away part of the open source design, as without the DRAM Controllers, the chip cannot be finished. I am also still looking for the PCIe Gen3 Controller with the 4-lane interface, and I have not yet found it in the source code. Having just finished the design of a PCIe Gen3 Endpoint Controller with the 4-lane interface I can say that this is not a trivial undertaking, but again, if this is not part of the open source and it has to be sourced from a commercial entity, then most core components of this chip are open sourced, but not the whole design.

    Either way, if in fact the ICT CAS taped this chip out at TSMC on a 28 nm node, then a maximum clock frequency of 1.2 to 1.3 GHz is believable. If that was the target node and if ICT CAS used the Synopsys DDR4 DRAM Controllers, then I have a hard time believing that this processor achieves the ARM Cortex A76 level of performance. Considering the more complex ISA of the ARM processor with more compound instructions and its higher clock frequency, I’d think that the XiangShan processor is probably at the 60 – 70% of performance level of the Cortex A76. It nevertheless is an extraordinary achievement from the ICT CAS team. I do not want to take away anything from their impressive achievements, and considering the learning curve, I’d wager that it won’t be long, and they'll outperform the ARM family in per-core and per-processor performance metrics. However, like ARM, they have not solved the underlying problem of scalability.

    Looking up the meaning of Xiang Shan comes back as Fragrance Hill, a hill near Beijing. I am not sure if there is a deeper meaning behind this code name other than mimicking Intel’s use of lakes as code names for their processors.

    Right to Repair and Privacy

    Posted on by: Axel Kloth

    It looks like Steve Wozniak has commented on the Right to Repair. When Steve talks, I usually listen. I agree that he has a point as the repair of a device if oftentimes economically and ecologically advantageous. It does not make sense to throw away a device that is repairable when and if a non-crucial component is damaged. The question then is what is a non-crucial component, and when should a device be deemed non-repairable because the repair has an impact on whether we trust the repaired device to keep secrets the same way that the original device did.

    After all, for many of us, the phone has become the holder of all of our secrets – bank accounts, passwords, social security number, cashless payment system info, birthdays, fingerprint or iris scans and so on. We inherently trust Apple and Samsung, Qualcomm and a few other companies that supply the mainstream phone manufacturers because they have an established supply chain that is thoroughly vetted and continually monitored for compliance with applicable laws and additional policies imposed by the phone manufacturer.

    Let me use an analogy to highlight where the issue lies. Let’s say your Mercedes is involved in a fender-bender. Only the right front fender, which is bolted on and has no impact whatsoever on anything in a subsequent crash, is damaged. You would not throw away the car because the fender is bent. You’d repair the car, and you might chose to use an aftermarket fender that is cheaper. While it may rust through ten years down the road, possibly earlier than OEM would have, there is no impact on the safety of the car itself, its value, or its longevity. If a crucial part of the car is damaged, such as the components that make up the safety cell, it is usually deemed non-repairable because in a subsequent accident, the car might not protect you.

    The same applies to the phone. If a socket, a switch or the display get damaged, repairs are and should always be possible and feasible. Same for the battery.

    Now if the CPU or the security processor needs to be replaced, I would deem that non-repairable by a 3rd party. Why? As stated above, Apple and Samsung and other established phone manufacturers have a supply chain that is well controlled. If you bring your phone with the dead CPU or security processor to shadytreephonerepairs.com, which sources the security processor from insecurityprocessors.com and the firmware for it – to resemble something like a working security processor – comes from backdoorfirmware.com, then I’d question if the prior level of trust in the phone to keep things private can be re-established.

    If trusted and trustworthy 3rd party repair companies exist that have a supply chain that is similar to the OEMs, then that is a different story, but history has shown that that is not always the case. The danger of knowingly or unknowingly using counterfeit parts always exists if the supply chain is not clear.

    Louis Rossman, who runs the non-profit Repair Preservation Group Action Fund, has shown in many of his videos on Youtube that manufacturers’ repairs (or attempts thereof) are not necessarily always successful or even state of the art, but at least they use the same components as the OEM, coming from the same supply chain.

    What is my take? I support the Right to Repair movement and think it makes sense. However, some provisions have to be made to either limit that right if crucial components are affected, or to help establish 3rd party repair companies with a supply chain and associated quality standards and a guarantee from that company to the user that OEM parts are used. Those limits must be made very clear so that there is no confusion over the components that are excluded from the Right to Repair bill. For example, everything that is on an export control list must be excluded from the Right to Repair provisions. Enforcing the compliance with these regulations will not be trivial. I guess that the laws encompassing the Right to Repair will have to be very strict and detailed, and they will have to reference or include many other and secondary laws, and as such making the repair industry compliant could be the economic nail in the coffin for many devices that were initially targeted by the proposed Right to Repair laws.

    VPNs and TOR – a Security Assessment

    Posted on by: Axel Kloth

    As a result of my postings with regards to the supply chain attacks against SolarWinds and Kaseya I have been asked what I think of VPNs (Virtual Private Networks) to fend off some of the threats. We will analyze that.

    I think the term VPN has been misused lately, so I’d like to clear up what it is, and how the different types of VPNs compare. While we are at it, I’ll throw in TOR as well.

    VPNs have been around for a while, and they protect data-in-transit. The traditional understanding of VPNs is that they span a Virtual Private Network across the planet, and they use the Internet as a transport medium. A VPN will “tunnel” through the Internet. It does that by encrypting the traffic between the endpoints after having authenticated them. Companies started using VPNs as a means to connect remote workers to a central office. These types of VPNs worked well as they made the contents of the VPN that tunneled through the Internet invisible to any attacker. Any attacker was only able to identify the two endpoints of the communication, but it was and is impossible to see the payload (the contents of the communication). The VPN endpoint usually was the employee’s computer or a VPN-capable firewall at the employee’s home office. It did not matter where on the planet the employee was, he or she could log into the company’s network and computers and server as if he or she was in the office.

    A very similar type of VPN was used between two or more branch offices of a company. That made it possible to transmit secret messages between branch offices and remote workers without the risk of someone snooping or inserting unauthorized contents. In those applications, firewalls were deployed that acted as VPN termination points. Each branch office had to use VPN firewalls to participate in the secure communication. They had to be set up properly with the encryption type, a pre-shared key, and the appropriate methods for authentication (for both the phase 1 and phase 2 negotiations).

    It is important to mention that both of these types of VPNs are end-to-end connections with end-to-end encryption. Typically they use a protocol called IPSec, which allows for a persistent connection. This is in contrast to SSL and TLS, which are transient connections and tunnels.

    To my knowledge, VPNs were never successfully breached for as long as they were set up properly. The problem with these VPNs was largely that it was difficult to set up, and that interoperability was not great. Cisco gear did not want to communicate with Juniper gear, and others did not fare much better either. The biggest challenge though was the implicit assumption that all endpoints were secured and free of malware. In essence, the VPN treated all endpoints as if they were local resources on the internal company LAN. That assumption was oftentimes proven wrong. Particularly on computers that were bought and administered by the employees (“BYOD”, or Bring Your Own Device policies), the trust put into these devices was misplaced. Any malware and worm that was capable of spreading within a LAN was capable of traversing the VPN, and so VPN-attached endpoints had to be considered about as safe as any device on a DMZ or De-Militarized Zone. As a result, malware scanners in the firewalls were made necessary for all VPN endpoints. That additional policing and filtering took away many benefits of the VPNs, and as a result, VPNs were abandoned by many companies.

    We would advocate for the continued use of VPNs if possible, particularly from a security and performance standpoint. As a result, we have implemented a smart NIC in our Server-on-a-Chip that executes all necessary functions for packet filtering, encryption and decryption as well as authentication independent of the application processor cores. To make it clear, VPNs and packet filtering alone will not stop infiltration or exfiltration of data caused by supply chain attacks as those are directed against the server infrastructure itself, but with VPNs and filtering and authentication as well as with code signing the likelihood of another supply chain attack against an IT infrastructure software provider can be minimized.

    The term VPN has been taken up by a new crop of companies promising better Internet security. Let’s see if that holds up to scrutiny. Most Internet users are connected to the Internet through their Internet Service Provider, or ISP. This can be DSL if your carrier uses phone lines, cable if your ISP uses coax cable for TV and data, or glass fiber if your provider uses Fiber to the Home (FTTH) or Fiber to the Curb (FTTC) and a secondary fiber from the curbside unit to the home. In either case, all of your traffic goes through your ISP. Your ISP can therefore monitor and observe all your traffic, including but not limited to your DNS (Domain Name System) traffic. The DNS infrastructure is a very important part of the Internet, and it resolves the symbolic name you type into your browsers’ address bar into an address that networking devices understand. While you type NYT.COM, your computer does not know what that is. Neither does your Internet access device, but it will ask the nearest or pre-configured DNS server what that is in IPv4 or IPv6 language. The DNS server will resolve this and report back 199.181.173.179, which is a public IPv4 address. Your computer’s browser can now establish a connection with NYT.COM under that IP address. Since your ISP has access to all of this traffic, your ISP knows which web sites you visit. If some of them are less than clean, your ISP will know that. VPN services have understood that and offer a solution to this. If you sign up to them, they will send you credentials for a VPN endpoint setup, and if you configure your PC, wireless router or LAN router with these credentials, then the traffic between you and their ingress router will be encrypted. In other words, they obscure your traffic and the ISP cannot monitor it. However, since your VPN is terminated at the ingress side of the VPN services’ router, and since your ISP cannot provide DNS services to you, the VPN provider will resolve all DNS queries, and then forward and route your traffic accordingly. In simple terms, your VPN provider now has the technical ability and in some jurisdictions the mandate to collect your traffic and your metadata (your IP address and the IP address of all of the web sites you visit). In reality, you have not gained any privacy or security. You have shifted the data collection from your ISP to your VPN provider.

    This is very similar to TOR, short for The Onion Router. TOR was and is a special version of the corporate version of Firefox. TOR searches for other TOR users that have published their involvement and participation in a secret database, and it will then use a number of other TOR users as if they were a VPN service provider, while making sure that the traffic between you and the next TOR user is encrypted. Your traffic will in essence go through your ISP fully encrypted, and it will use a TOR user as an entry and entry point into the Internet. TOR actually uses multiple hierarchies of VPN tunnels so that TOR users and TOR exit points are not easily identifiable and traceable, but it is not 100% bulletproof as an analysis of delays can render enough data on how many layers of encryption and how many TOR users are in the chain to ultimately crack it and pinpoint the true endpoint of the traffic.

    TOR was originally developed by the CIA to allow whistleblowers all around the world to send secure messages to the CIA without exposing the source to anyone outside of the CIA. An analysis of the code of TOR revealed a whole lot of measures that were taken to protect the whistleblower, but also plenty of vulnerabilities of the concept, including a phone-home provision. As one would expect, the author of this used TOR after having removed the phone-home provision, and experimented with the browser and its concealment provisions. Today, TOR is mostly used to access the Dark Web. That is a theme for a different discussion and blog post.

    Yet another Ransomware Attack

    Posted on by: Axel Kloth

    It seems like there is no letting up on ransomware attacks. Axios reports that Kaseya hackers demand $70 million in massive ransomware attack. This appears to be yet another supply chain attack. A supply chain attack is an attack that is directed at a company that provides IT infrastructure management tools. A fraudulent piece of software is inserted into this management tool or toolset. Usually, this piece opens up a backdoor by which the attackers can access every user of the management toolset. Since the tools has administrative privilege, it can do anything it wants. In the case of ransomware attacks, the attackers encrypt the server(s) contents to a key of a keypair only they know. If the attacks are noticed in time (i.e., before the last valid and unencrypted backup is overwritten by an encrypted one), then all it takes is to disable the IT management tool and restore the prior, most recent valid backup. In that case, only the data that was generated between the most recent valid backup and the time of the breach notification is lost. Restoring a backup also takes quite some time, so the business will be interrupted for a while. This is not a desirable situation to be in, but it is better than having to pay a ransom and hope that the criminals behind this attack release a valid decryption key. In my opinion, paying a ransom should be made illegal to avoid making ransomware attackers commercially successful.

    We also need to think about additional levels of administrative privilege. Today, there is the admin (or root) and the ordinary user. This needs to be amended. There need to be admins and super-admins that administer certain rights and enable and disable admins, install firmware updates and updates to the OS kernel. Admins should then be restricted to administering users and applications above the OS level. Users should only have the right to store and retrieve data and use applications installed by admins. Very limited privileges for code execution should be granted. Limited users should only be able to store and retrieve data, without any rights to initiate any code execution. In those scenarios, IT management tools only have admin privilege, and so bulk encryption cannot be initiated. For firmware and OS kernel updates and changes, the tools must require elevation to super-admin privilege levels, and that will require additional authentication. As an industry, we might also have to rethink hardware, firmware and OS kernel security. Ultimately, we might also have to rethink the current approach of data-in-transit and the data-at-rest paradigm.

    Here is a scenario I’d like you to consider.

    Imagine that you need to transport large amounts of cash from A to B. Obviously, since cash has value, you’d like to protect the transport. So you build a transport van that protects the payload. The drivers are in an unprotected cabin. Does that make sense to you?

    Similarly, once you arrive at the bank, the vault is pretty safe, but the tellers are unprotected. Does that make sense to you?

    Well, that is exactly what we see in Internet security. There is a focus on data-at-rest and on data-in-transit. What’s not protected is the whole set of devices storing the data (even though the data may be on disks that are encrypted) and the whole set of devices that transport the data from A to B. In other words, Internet security focuses on protecting the payload. It does not make sure that the devices that transport and store the data are equally well safeguarded.

    We can see the impact of those decisions. Data gets stolen, “scraped”, breaches occur, and data is lost in attacks that encrypt your data and hold it ransom. Would not you think it is time to protect the devices that store and transport your data equally well as the payload?

    That is what Abacus Semiconductor is doing. Besides being a high-performance solution with our Server-on-a-Chip and our intelligent memory subsystem, we protect the device that protects the data-in-transit and the data-at-rest. With those measures in place, we might be able to stem the tide of ransomware attacks.

    The USA turned 245 years old (or young)

    Posted on by: Axel Kloth

    This Fourth of July marks the 245th birthday of the United States. It gained its independence from the British in 1776. In five years, it will turn 250 years old. Not all democracies in the history of humankind have survived nearly a quarter of a millenium. Let's hope that the US democracy is going to be in a much stronger and better position then.

    Intel interested in SiFive

    Posted on by: Axel Kloth

    It looks like Intel has submitted a bid for SiFive, the commercialization entity for the RISC-V processor and ISA. According to Tom's Hardware and Bloomberg, Intel Offers $2 Billion for RISC-V Chip Startup SiFive.

    This would be a validation not only of RISC-V, but also of the ecosystem around it. I doubt that Intel would make that acquisition to shut down RISC-V as SiFive is not the RISC-V steering committee but merely the commercialization branch of RISC-V. This appears like a genuine approach of Intel to diversify into non-x86-64 processors. While SiFive has a focus on embedded systems and therefore can help Intel fend off ARM in those areas, I believe that this might signal the end of x86-64, and now puts pressure on AMD. This was a brilliant move, because as I have mentioned in my blog multiple times (FTC opens probe into nVidia and ARM merger, nVidia and ARM merger hits roadbumps, Apple and ARM and nVidia buying ARM), ARM will cease to be independent when the ARM acquisition by nVidia goes through. ARM will be under control of nVidia.

    That means that all ARM architecture licensees will feel the heat from nVidia. Why does it matter? ARM architecture licensees are multi-trillion $ companies:

    • Apple
    • Qualcomm
    • Broadcom
    • Samsung
    • NXP etc

    All of these companies will have to decide if they want to stick with nVidia/ARM or switch over to RISC-V.

    RISC-V is an open Instruction Set Architecture and can be implemented by anyone. If need be, anyone can buy RISC-V processor designs from SiFive and in the future from Intel - or from us. We have been using RISC-V since about 2012 and have experience with it.

    We are the only company that has developed extensions to RISC-V for performance, security and scalability, and we have added hardware support for virtualization.

    That makes us the only company that can get RISC-V into the server and internet backend.

    Hardware Vulnerabilities

    Posted on by: Axel Kloth

    Software design is not quite as robust and structured as hardware design. Nevertheless, even in ASIC and processor design sometimes bugs slip through the cracks, and they create problems that can be exploited. We have seen that with Spectre and Meltdown, both of which which exploit an issue with out-of-order execution in conjunction with caching strategies. While Intel and AMD have issued patches, the problem is not fully solved, and users report performance degradation after applying the patches. Other hardware-related vulnerabilities such as Rowhammer and Half-Double make use of knowledge of the physical characteristics in DRAM, and I am fairly certain similar exploits can be devised for Flash. These exploits rely on side effects of mechanisms that were intended to improve processor and memory performance, and they were not foreseeable by the designers of the processors and memories. Nevertheless, they are devastating in their effect. I am certain that security will be one of the more important design considerations for the next generation of processors and memories. We have taken those vulnerabilities into account and have made sure that our processors and intelligent memory subsystems do not exhibit them.

    RISC-V in HPC

    Posted on by: Axel Kloth

    SiFive claims that it has taped out a 5 nm TSMC-produced HPC and AI capable processor in its press release here: SiFive RISC-V Proven in 5nm Silicon. In a lot of ways, that is great news as it proves that RISC-V is fully capable of supporting HPC. However, as I have said many times before, the ISA does not matter. Being successful at running HPC workloads breaks down to how accelerators are included, and how memory is attached. My prior blog post ISA versus System Architecture points out what the fallacies are. We believe that there are many more things to HPC than just a CPU made on a 5 nm TSMC process.

    Communication, metadata and endpoint security

    Posted on by: Axel Kloth

    Vice reports that Hacking Startup 'Azimuth Security' Unlocked the San Bernardino iPhone. That's not too surprising as the prior suspect did not seem to have the wherewithals to do so. According to the article, "Motherboard can confirm a Washington Post report that said Azimuth Security developed the tool used on the San Bernardino iPhone".

    Let me quote some more from the article because I think that particularly the last paragraph is of importance: "Shortly after the FBI successfully accessed the phone, rumours circulated, originating with a single Israeli press report, that established phone-cracking company Cellebrite was behind the hack. Those reports were unsubstantiated, though. After unlocking the device, the FBI found no previously unknown message data or contacts."

    Why is that important? The investigators had all of the information they ever needed without cracking the phone or without having access to the encrypted communication (and neither did they need a secret backdoor). All they needed was the metadata - and I had stated that many times. There is no reason whatsoever to try to outlaw strong encryption. See my older post on this here: DOJ on Encryption.

    Yet another supply chain attack and breach

    Posted on by: Axel Kloth

    Gizmodo reports that U.S. Federal Investigators Are Reportedly Looking Into Codecov Security Breach, Undetected for Months.

    This is another supply chain attack, similar to the one used in what is called the SolarWinds breach. In this case, the attackers appear to have been able to gain access to a system that allows users to upload software to be tested onto a test server, and while that does not sound dangerous, the attackers likely were able to either extract user credentials directly, or added a backdoor that would send them the credentials of anyone who logged in.

    If that is the case, then the number of affected users is greater than the 29000 users mentioned. This is in fact dangerous as it balloons. It might not seem obvious, but like with the SolarWinds hack, the direct damage is bad, but the indirect damage is worse. Let's say that a Microsoft super admin's credentials were obtained in the SolarWinds hack, and the admin changed credentials. He or she wanted to make sure that not only new credentials were used, but that the software was updated as well - including MS Exchange and its cloud equivalent. So far the admin did everything right - change credentials and make sure that known vulnerabilities in the software are fixed. Now this admin submits the new code with the new credentials to be tested only to later find out that someone stole his or her new credentials during the upload for 3rd party verification of curing the bugs... This is about as bad as it gets.

    For the SolarWinds attack itself, there is plenty of updates and news, such as NPR summarizing it and updating it with some new info, and of course the Biden administration imposing sanctions on the perpetrators of the hack in the following Bloomberg article Biden Sanctions Russia, Restricts Buying New Debt After Hacking but at the same time trying to de-escalate the situation in Biden calls for de-escalation with Russia following sanctions, proposes meeting with Putin.

    Neocortex Supercomputer

    Posted on by: Axel Kloth

    It looks like there is some new life in semiconductor companies and in funding for them. It was about time as I fundamentally disagree with the notion I have heard at least one too many times that we have invented all that there is to invent. Current CPUs and all GPGPUs combined do not solve many computationally intensive problems effectively and efficiently. Certainly Cerebras made a splash here, and that is a great sign. The Next Platform reports that national labs are working on the Neocortex Supercomputer to Put Cerebras CS-1 to the Test.

    I completely agree that novel solutions to novel computational problems must be found, designed and funded. AI and ML training certainly qualify, but there are plenty of other unsolved problems that do not require wafer-scale compute. While Cerebras and we have different approaches to different problems, I am encouraged to see that funding and the willingness to try out new solutions are available again.

    For way too long have analysts and VCs focused on CPUs and GPUs only. It seems that CPUs have settled on x86-64, ARM and RISC-V, and within the GPGPU category nVidia is leading the pack due to CUDA, but DSPs and vision processors for industrial and automotive real-time control applications, security processors and math processors as well as large-scale integer-only database processors are needed to fill the gaps that CPUs and GPGPUs cannot fill.

    ARM v9, a new ARM processor architecture

    Posted on by: Axel Kloth

    ARM has released a new version of its processors and instruction sets. They hope that with this processor and ISA they can compete better and more effectively with Intel, AMD and all others in the Internet backend and in the data center. Whether that ISA or processor actually can achieve what some publications claim remains to be seen, as we can read here on CNET, stating Arm's new chip architecture boosts security, speed for billions of processors or here at Bloomberg with the headline Arm Takes Aim at Intel Chips in Biggest Tech Overhaul in Decade.

    As far as I can tell from the announcements and the available technical literature, ARM does what Intel and AMD have been doing, and there is really no difference in the approach: improve the number of instructions executed per cycle, and adding instructions to the ISA. Both measures increase the complexity of the cores, add to the size and power consumption, and require ever-more sophisticated caching strategies along with a required increase in cache size. I wonder if that approach is the right one to take. Particularly in light of more processor and OS modes, the attack surface will only grow with an increase in permitted instructions of the ISA.

    The undead are returning once again

    Posted on by: Axel Kloth

    It looks like the IT world has to deal again with the longest-living undead that I can think of, and that is the entire UNIX copyright saga around AT&T and SCO/Novell and now Xinuos - what a creative name. An anagram of UNIX appended by the abbreviation of Operating System as a company name. Clever. Not so clever is the fact that the Fear, Uncertainty and Doubt (FUD) has re-emerged, and threats of lawsuits against UNIX users are back. ZDNet has a great overview of this story here: SCO Linux FUD returns from the dead.

    The Register has an equally good overview of the background and some of its implications here: IBM, Red Hat face copyright, antitrust lawsuit from SCO Group successor Xinuos. I wonder when this is finally settled in court or through the passage of time. I am getting pretty sick of this kind of use of the legal system. Certainly patents and other IP should be protected, and the prevailing party should have the rights to the proceeds, but patent trolls and re-litigating cases that have already been decided or settled only clog up the judicial system for the legitimate cases.

    A great overview of DMA

    Posted on by: Axel Kloth

    Well, now here is a blast from the past, and it is even incredibly well-written: DMA. That is one of the many battles most processor and SoC designer face. There are two camps, one usually consisting of the software designers. They want to do everything in software, from polling instead of IRQs to memcopy and transferring data to and from peripherals. The hardware camp is usually of the opinion that everything should be done in hardware, so they advocate for IRQs and all data transfer for I/O executed by a DMA Controller (or its more advanced cousin, the IOMMU). After seeing that IRQs and traps and exception handlers are all now treated the same, I am firmly in the camp of hardware support. Therefore, the article came at just the right time. It takes less energy to have I/O (and even memory-to-memory) transfers executed by a DMA Controller. It is faster, it can be made more secure, and it is more reliable. It has very few, if any, downsides. The complexity of a multi-channel DMA Controller or even an IOMMU is so low that omitting it will not substantially impact the die size of a processor or SoC or even a microcontroller.

    We have been designing and using DMA Controllers and IOMMUs for a long time as there is simply no substitute for simplicity, performance, security and power savings while at the same time allowing concurrent operation between the DMA Controller/IOMMU and the CPU.

    Cybercrime

    Posted on by: Axel Kloth

    As chip designers and as IT admins, as users and as ordinary citizens we have probably intuitively sensed that cybercrime is on an upwards trajectory. The US CISA has statistics on the number of cybercrimes committed, and even looking at the number of CVEs filed on MITRE.org or the overview complied at SANS.org, it is pretty clear that cybercrime is not going away, but instead growing. CNET compiled the numbers, and they are staggering: Cybercrime in the US jumped by 55% in the past two years. The annual loss reached $4.2 billion in 2020.

    In other words, cybercrime has reached a magnitude that requires all of us to take it seriously, and more importantly, will require us to take action. Security - and particularly cybersecurity - is not a spectator sport. We all have to proactively do something to protect us. Using a secure browser on a secure operating system, with passwords that are long enough and are not reused across many sites are a good start. Then contemplate using a hardware firewall - maybe there is one already built into your wireless access point or your cable or DSL modem. Learn how to configure and use them. Practice some digital hygiene. Not everything has to be posted on Facebook or Instagram or whatever other social media there is..

    North Korean Hackers at it again

    Posted on by: Axel Kloth

    ArsTechnica reports that hackers that were exposed by Google and Microsoft have reacted with classical counter-warfare by targeting those that exposed them. ArsTechnica states that North Korean hackers return, target infosec researchers in new operation. That is not too surprising, but in essence this is quite an escalation of severity. However, not only did they escalate, but they also set up a fake web site purporting to be a security company with a keen focus on research for digital security.

    CNN reports that Russian hackers did the same to those that exposed them in connection with the SolarWinds breach. According to CNN, Hunting the hunters: How Russian hackers targeted US cyber first responders in SolarWinds breach, Russian hackers directly targeted these researchers. Again, a very classic counter-warfare operation.

    That means that even more so than before, everyone - including security researchers - have to be very careful when sharing zero-day attacks, vulnerabilities, novel exploits or any kind of security holes even in procedures and processes. At this point in time it seems as if the Western democracies do not fight back hard enough. While the US has offensive cyber warfare capabilities, it seems like the US is more on the receiving end of cyber warfare. Unfortunately, there is no global legal framework in place that is agreed upon and can be used to determine who the perpetrator was, and what the penalties are. On top of that, if the cyber warfare program is a state-run effort, it is hard to believe that the states that conducted these attacks will agree to any kind of extradiction in any treaties.

    Firmware attacks are on the rise

    Posted on by: Axel Kloth

    I have been warning for a while that firmware and the boot process are not as secure as they could have been made. Firmware updates can be executed unauthenticated, and with physical access to a device modification of firmware is borderline trivial. All of that poses a danger, and I have mentioned it multiple times. In each of my blogposts about Pre-Boot Security, Security in the News, Newsmax rehashing debunked stories and Resilient Secure Boot have I pointed out that the firmware attacks are the most malicious ones as they cannot be detected from a running OS or by any malware scanner.

    On top of that, firmware controls the basic behavior of the system itself and of all of its embedded components. As far as I know, there is no malware scanner for the embedded controller in hard disks, SSDs, RAID Controllers, SAS or SATA Controllers, or LAN network interface cards, smart NICs and wireless LAN (WLAN/WiFi) Controllers - and the keyboard, the mouse and touchpad/trackpad. Even the GPU has its own firmware, and I have yet to see a scanner for malware on GPUs. As a result, I completely agree with Microsoft's assessment that Firmware attacks are on the rise and you aren't worrying about them enough.

    Micron giving up on 3D XP

    Posted on by: Axel Kloth

    There are more reports on 3D XP, and more in-depth information becomes available on why Micron to decided halt development of 3D XPoint and sell its Flash Memory Fab. Apparently, there was not enough demand to even load one fab fully. That is not entirely consistent with a prior statement saying that Micron was unable to scale out production. The only way both statements can be true at the same time is that scaling out production was so expensive or technically challenging that Micron gave Intel for its Optane memory an upper limit of what can be produced, and even if Intel absorbed all of it, it was not enough to fully utilize the Utah fab for this product. That would be a truly sad ending for a once promising technology. In my earlier blog post about 3D XPoint Memory Discontinued I assumed that in fact Micron was unable to scale out production of 3D XP and as a result was unable to make this technology self-sustaining. The lack of demand surprised me.

    In essence, it means that everyone needs to go back to the drawing board for a successor to Flash. While Flash has improved drastically, the fact that we need to deal with wear is somewhat problematic, but the biggest downfall is the write performance. In essence, the industry is still stuck on the growing performance discrepancy between processors, DRAM and Flash. I had hoped that 3D XP or more generally Phase Change Memory would have taken over from Flash as the leading memory type for density. Had that been the case, we could have expected DRAM manufacturers to focus more on performance instead of density, and created a lower-latency DRAM main memory. Unfortunately, that did not happen.

    Pre-Boot Security Gets More Secure

    Posted on by: Axel Kloth

    Booting up a computer has not changed in over 30 years. While the TPM and a few other improvements were invented along the way, essentially any computer today boots the same way a 30 or 40 year old machine did. The only difference fundamentally is that way back then, computers booted out of a ROM, an EPROM or an EEPROM, which was essentially unchangeable via software. While that made it impossible for attackers to insert malicious code into the boot (or pre-boot) environment, it also prevented easy updates to the firmware to fix bugs. Over time, the pre-boot environment became more important as the boot code itself grew in size. With more and more of the firmware migrating to EFI/UEFI, the pre-boot environment has taken over what used to be the initial startup, and it is unprotected. I had designed a novel pre-boot environment with primary focus on security and resilience, and I am glad to see that this is seen as a pressing issue today. Proof of that is that an increasing number of companies focus on pre-boot security, such as Lattice Semiconductor. While we are doing our part, we need the industry to come up with solutions that universally work, and that can be standardized and evaluated in their attained level of security and resilience.

    Once that is achieved, the industry will need to develop more stringent tests for security and resilience in pre-boot and boot environments, and that will have to include both hardware and firmware to withstand and log all evidence of attacks.

    Attack vectors will have to include those that assume physical access to or possession of the device under attack. After all, with intelligence pushed out to the edge like for 5G, we will see many more devices at the edge in physically not very secure enclosures, and that will make these devices much more vulnerable to snooping, denial of service attacks and theft with the intention to steal keys.

    3D XPoint Memory Discontinued

    Posted on by: Axel Kloth

    Years ago, Intel and Micron had set out to develop a new kind of memory. It had to be faster than Flash and denser than DRAM. It was also supposed to not show wear and degradation. They called it 3D CrossPoint, or short, 3D XP. It was Phase Change Memory, but if you called it that, Intel got upset. I thought that the approach was genuinely good and novel and worthwhile, but Intel lost interest a while ago, so Micron took it all over. Unfortunately, Micron could not create a large-scale production-worthy version of it, and as a result, it was canceled altogether. That's really a loss as I think that it could have been the missing link between Flash and DRAM. While I had assumed for while that it would not see commercial use, the announcement at The Next Platform came as a surprise, and it now necessitates that the industry rethinks what the new memory type between DRAM and Flash with regards to their density, latency and bandwidth should be.

    I would have liked to see some new form of memory in that gap, but it seems that we may have to revert to what Abacus Semi is doing and what its HRAM will show in terms of performance, density and of course the price point.

    SolarWinds breach detection tool

    Posted on by: Axel Kloth

    You know that things are bad when CISA releases a new SolarWinds malicious activity detection tool. While it is great that the community now has one more tool in its toolchest to detect if a breach occurred, it means that most CIOs and CISOs don't even know if they have been breached, and if so, to which degree data was exfiltrated. That is a complete and utter disaster.

    The more we learn about the SolarWinds breach, the more it becomes clear that its impact has been vastly underestimated. I think it may have a lasting effect for all software vendors, and that includes open source and closed source. As far as I can tell, it might also undermine trust in cloud services and online tools such as Office365 and similar or equivalent solutions.

    Our unique selling point

    Posted on by: Axel Kloth

    Unleashing the true potential of massively parallel compute solutions is a system architecture decision, not a result of using better process nodes. We do not have to rely on Moore's Law.

    We are not banking on Moore's Law alone to improve computer and processor performance. Intel, AMD and all ARM licensees can do that just fine, and that is not the major point of criticism of the status quo we have. Any processor designer can probably improve on the performance of any given processor - with or without Moore's Law. What we are saying is that ganging up many processors does not work well enough to provide linear scaling of performance across a broad range of computational problems. In other words, 10,000 cores in a Linux cluster will not solve a computational problem 10,000 times faster than one core. Why is that? Very simply processor design has been too successful over the past 20 years. The discrepancy between processors, accelerators and memory has only grown in the past 20 years. That now demands a fairly coarse granularity of tasks, and that is counterproductive. What do we mean by that? If a processor core works on a task that would take it 10,000 CPU cycles to complete, and we have 100 cores available, then distributing the task onto the 100 cores should allow all of the cores to complete the entire task in 100 cycles as we have parallelized the problem. So ideally, we have 100 times the core count, and 1/100th of the execution time. That would be ideal linear scaling. The problem is that it does not work that way.

    First of all, we face the problem that we need to chop the big problem into 100 chunks. Then we need to verify that there are no dependencies among the chunks. The first part will likely take 100 cycles all by itself, on one core. The second part can be parallelized, but it adds to the total complexity of the problem. So let's say we either ignore the verification of interdependencies, and let the cores figure it out once they work on the problem. We still have to distribute the chunks. That will take a fixed but fairly large amount of time as we have to share the data and instructions across cores in a cluster within a CPU, then across CPUs, and then across computers to other CPUs and cores. This will consume 10s to 1000s of cycles, and we have not worked on the parallelized problem yet - only on chopping up the task and distributing it. Once fully distributed, the cores work on the problem for about 100 of their cycles, and then send the results back. Now the initial scheduling core will take in the results and consolidate them. It is in the realm of possibilities that all of this overhead will cause the result to be ready after about 10,000 cycles on the CPU core that tried to farm out the problem to accelerate processing. In other words, we have not saved any time. Instead, we have kept many more cores busy, used the CPU-internal resources, the network and a whole lot of other infrastructure without gaining anything. The reason for this is simply that cores process data much faster today than they did 20 years ago, but memory has not sped to the same degree, and while network bandwidth increased dramatically, the latency (the delay) has become worse. As a result, it does not make sense to farm out small tasks of 10,000 or 100,000 cycles. The granularity of tasks that can and should be farmed out has gone up drastically, and as a result, it has become very coarse. Fine-grained granularity only makes sense across adjacent cores like the ones in a ccNUMA system, or across cores in a coherency domain.

    What needs to happen is that core processing performance, inter-processor communication and memory bandwidth have to be re-balanced to allow for reasonably fine-grained granularity of tasks to be parallelized. That is what we are doing, and that is where our unique selling point lies. If your computational problem can be solved in one core or one processor, we will not be the solution for you. If your computational problem is big enough to keep millions of cores busy, we are the right partner.

    A real-world Spectre exploit

    Posted on by: Axel Kloth

    It was only a matter of time. Spectre and Meltdown are vulnerabilities that so far had only shown the risk of thoretical exploits, but none in the wild. That has now changed. The Record claims that the First Fully Weaponized Spectre Exploit Was Discovered Online. In essence, that means that a non-fixable hardware feature has been exploited by real-life malware. With that attack now being a blueprint for others, more malicious attacks will be carried out against processors for which no hardware fix is possible, and the firmware fix is not only reducing performance, it does not guarantee that the system is not vulnerable any more.

    That is about the worst case scenario imaginable. Protecting against an entire class of attacks is going to be necessary, but how that is going to be done is unclear to me. I am not sure if firewall rules can even be written to detect such attacks.

    SIMD and why I don't like it

    Posted on by: Axel Kloth

    I have criticized SIMD ("Single Instruction, Multiple Data") architectures for quite a while, and for one reason or another I got a lot of questions and feedback on my take on it in the past few weeks. I'll try an analogy to point out what SIMD is, why it is not always useful, and why MIMD ("Multiple Instruction, Multiple Data") is not a whole lot better. EPIC (Explicitly Parallel Instruction Compute) is dead, and approaches with VLIW (Very Long Instruction Word) and extremely wide instruction words and decoders/predecoders don't seem to hold much promise either.

    Imagine you have a very long wall that needs to be painted. You have 100 painters available to speed up finishing the job. A foreman will go and create 100 sections of identical size, and then direct each of the 100 painters into a certain position. Upon job start, each of the 100 painters will do exactly the same - down to the movement of the paint roller. If the wall was flawless and the surface required exactly the same amount of paint for proper coverage all along the wall, the result will be a perfectly painted wall that was finished 100 times faster as if a single painter had done it.

    If the wall surface is not perfect, it is usually not a big problem for the foreman or a few painters to fix the remaining issues, and if the wall had a few doors and windows and trim, then these doors and the trim can be repainted with the proper paint after the wall paint has cured. The windows have to be treated such that the paint is scraped off as all painters will have painted over the windows as well.

    The more windows there are, and the more the wall surface is imperfect, the more repair work will have to be carried out, and the advantage of using only one foreman and 100 painters doing exactly the same goes away.

    A seemingly simple solution would be to associate one painter to one foreman so that the foreman can tell the painter at any given time what to do. In actuality, that's not a solution as we now might still have the time advantage of 100 painters and 100 foremen be done with the job 100 times faster than a single painter, but we have now deployed 200 people, and there needs to be a clearinghouse or arbitrator among the large number of foremen, so that will add to the total number of people on the job. Just the job of distributing tasks and jobs is now an administrative challenge by itself, so we have created a hierarchy of non-productive entities (the foremen that don't paint) that is associated with increaded cost and complexity, but we have not fundamentally solved the problem. In essence, that's MIMD. Instead of the task complexity itself we now have a challenge in the task distribution complexity.

    What we do is different. We are not using SIMD or MIMD. It is not EPIC either, nor does it rely on an extremely wide instruction decoder either. We cannot disclose what it is exactly, but it does not suffer from the SIMD or MIMD problems, nor does it stall when dependencies are discovered which is what bogs down VLIW architectures.

    New observations & questions

    Posted on by: Axel Kloth

    I think it is time to add one more set of observations to my older ones, which can be found here and here. I'll call this new set of observations Kloth's Third Observation.

    • 1. Most computational problems today are large enough to not be solvable in a single core or processor, assuming reasonable execution runtimes are required.
    • 2. As a result, programmers must be enabled to spread out the computational load over many cores easily.
    • 3. Ideally, the performance of a many-core system or massively parallel system should grow with the number of cores linearly.
    • 4. On-chip communication is lower latency, higher bandwidth and energetically more efficient than off-chip communication.
    • 5. Certain types of computational problems are not well solvable in a general-purpose processor as they either cause long runtimes or high power consumption or both, and as a result, coprocessors or accelerators for these kinds of computational problems are required.
    • 6. Coprocessors as accelerators solve specific problems and are faster and more energy-efficient than general-purpose processors and thus should be deployed to support general-purpose processors.
    • 7. #1 through #6 require that we put as many small cores as we can onto a single chip, and allow for massive I/O bandwidth to other processors, accelerators and memory.
    • 8. To achieve #7, we should build the largest processor and coprocessor feasible that we can fit onto a die economically.
    • 9. Ideally, processors and accelerators should have the same pinout to be interchangeable.
    • 10. They should use the same I/O to be able to communicate with each other and with memory.
    • 11. All coprocessors should be easily usable by programmers through the use of APIs (“libraries”) to prevent the need to re-invent the wheel many times over.

    With these observations in mind, we can now start to think about what we need to do to solve those problems.

    Security in the News

    Posted on by: Axel Kloth

    Computer and network security finally seem to get the attention they deserve. The past few years have been absolutely terrible when it comes to protecting everyone's personal and private data. Equifax was not the first, and certainly the SolarWinds hack won't be the last of breaches. The issue I see is that the approach to solving the problem does not seem to fit the underlying cause. The solution that is pushed by the industry is more software. If software were able to solve the problem, we would have solved it. However, what we observe is the opposite. More software equals a vastly larger attack surface. Therefore, more and better hardware is needed so that the attack surface is reduced, and then build better firmware and better software on top of it.

    Let's first get to the mentions of the solution providers The 20 Coolest Endpoint And Managed Security Companies Of 2021: The Security 100. Again, I am glad security is finally being brought up, and that is a good thing. If you look at the list, it is all higher-level software, without any regards to the underlying Operating System, the firmware or the hardware. Good thing is that there is backup and recovery software in case the breach did succeed...

    The next article is about support software that helps prevent permanent damage in case a company got breached. Prevention of breaches by using encryption, virtualization, logging, filtering and advanced analytics as well as heuristics will reduce the attack surface and to some degree the severity of a breach. These applications are all useful and needed, and they certainly have their place, but by themselves they won't solve the breach problem in the first place. The 20 Coolest Network Security Companies Of 2021: The Security 100 is a good read and the list is well-researched; however, I still think it lacks pointers to the real solution - better underlying hardware that makes the firmware more robust and needs less software to protect itself, therefore reducing the attack surface of the whole solution. This must start with a secure BMC for servers, secure processor cores and non-core components, hardware filtering, truly secure boot and firmware update hardware measures, and of course accelerators integrated into the CPU to facilitate these requirements. Only when that is accomplished can we try to harden the software (and that includes the Operating Systems as well as all APIs and the application software stack) against attacks.

    Newsmax rehashing debunked stories

    Posted on by: Axel Kloth

    Newsmax is rehashing two stories that long ago have been debunked, claiming that China Used Secret Microchip to Spy on US Computers. The first one is an old Bloomberg story that claims that China (all 1.3 billion of them?) installed secret microchips on SuperMicro server mainboards so that they can control them. The second story is about an alleged Pentagon attack in which China had infiltrated a non-classified Pentagon network part.

    I had commented on that Bloomberg story on Youtube while I was CTO of Axiado. First of all, server mainboards are designed with modern CAD software that creates a BOM (Bill of Materials) list, and second, the pick-and-place machines used in mainboard assembly use that BOM and the coordinates from the CAD software to build the boards, and then the quality control (QC) will take pictures of each mainboard and compare the actual built boards with the "golden model" of what it should look like from the CAD design files. Any extraneous chips would immediately be noticed. It is highly unlikely that a company as sophisticated as SuperMicro would not notice an extraneous chip on their mainboards. It is even more unlikely that first SuperMicro misses them, and then users such as Apple, Amazon and others, including us, would not notice any chips they don't recognize on server boards. Another claim in this story is that the extraneous chip will be able to control and monitor all of the traffic inside the server. Well, if that is the case then this little magic thing must have more computational horsepower than all of the Intel or AMD or ARM cores that legitimately are present in the server - and all of that without its own memory! If someone is that advanced to build such a chip, then they would not waste their talent on an attack as indicated above.

    What is possible is that an attacker installed malicious firmware on the server mainboard as the Baseboard Management Controller (BMC) - in essence a Linux/ARM-based PC inside a PC-based server - is not a very secure chip. However, that is firmware, not a chip. Huge difference... The outcome may be similar as certainly putting a backdoor in, or monitoring some of the management traffic is possible, but it for certain is not an extraneous chip.

    The other story that an unclassified Pentagon network was breached and unusual behavior of SuperMicro servers was observed has not been reported as true anywhere. This is a fairly wild claim that cannot be susbstantiated. Penetrating even an unclassified Pentagon network is not trivial. Assuming that compromised servers were installed and then created unusual traffic is much more likely, albeit I doubt that the Pentagon allows servers that were not subject to configuration management (and that includes firmware version) to be installed in any live network. I doubt the validity of that story as well. All Pentagon networks will have firewalls on each entry and exit point that check for infiltration and exfiltration, so if the attackers hoped that classified data would be exfiltrated from a breached non-classified network, then this was a test to check the Pentagon's response to an exfiltration threat.

    I am not trivializing the threat, but there are simply too many holes in these stories. A sophisticated attacker would use stealth methods, not an extraneous chip, not waking up the 800 pound gorilla for an unnecessary test and incident response. On top of that I have to say that the attacks out of China have been a whole lot less advanced than those we saw from Russia.

    FTC opens probe into nVidia and ARM merger

    Posted on by: Axel Kloth

    It looks as if about four months into the nVidia/ARM merger announcement some of the larger ARM technology licensees are finally waking up, and that the FTC is taking a cue from Britain's Competition and Markets Authority that this merger or acquisition is probably not going to be without an impact on licensees or end customers.

    The FTC is now taking a more detailed look into the nVidia/ARM merger, which in essence is not a merger but a flat-out acquisition of ARM. My concern was and is that a FRAND type licensing agreement is not a very likely outcome of that acquisition, and that it would harm all current ARM licensees. The question here is not if the current licensees could do what Apple did with its ARM-architecture application processors. That is the easier part as virtualizing them and then gradually replacing them is not too much of an issue as Apple has shown multiple times now.

    The much bigger problem are all the small and oftentimes ignored embedded ARM cores in peripherals and in most of the I/O, including WLAN and USB and other components. They are designed and built to run on as little memory as possible and therefore cannot be virtualized. As a result, nVidia knows it has more than a bargaining chip. It can exert undue financial pressure on licensees that simply cannot replace the embedded ARM processor cores in time to avert an increase in licensing fees. I would not be surprised if any modern phone contains 30 to 40 embedded ARM cores - removing and replacing them is a nightmare, not only for the hardware development teams but in particular for the firmware teams as these cores likely run on a very broad variety of operating systems or even on bare metal application code.

    I assume that Google/Alphabet, Microsoft and Qualcomm and others have come to the same conclusion accoding to Bloomberg and CNBC.Looking at the licensing terms and conditions to CUDA - nVidia's signature piece of software that makes GPGPUs hum - could have provided regulators and licensees a clue towards what nVidia is up to. I am amazed that it took them so long to find and voice their objections.

    Let's see how this pans out. I am sure that the last word has not been spoken in this saga.

    nVidia and ARM merger hits roadbumps

    Posted on by: Axel Kloth

    I had voiced my concerns on the nVidia/ARM tie-up on 2020-10-12 (nVidia buying ARM) and on 2020-10-14 about Apple and ARM on my blog.

    It looks like Qualcomm is finally deciding that maybe I was right, and is now objecting to the acquisition. That's a bit late, and I assume that it might not have an impact on the regulators, but the big wildcard in the game is China anyway. If China objects or obstructs the deal, then it does not make a difference whether Qualcomm objected.

    My argument was and still is that ARM is a licensing company for a widely used processor architecture and as such has to be independent of any of its licensees. An independent entity can make sure that there are no favorites, no undue influence, and that licensing can follow a FRAND (fair, reasonable and non-discriminatory) model. This would not be possible under an nVidia ownership - and neither would it be under the ownership of Apple, Samsung, Qualcomm or Broadcom or any other the other big names. That is not to say that these are bad or unethical companies - the opposite is true. These are all good companies, but they are driven by (relatively) short-term profit requirements as they are all public companies, and as such have to have a leg up over their competitors. That in turn means that the primary driver is an increase in licensing revenue, and while it long-term might destroy the ARM ecosystem, it will bring monetary benefits in the short-term.

    I believe that ARM would do better if it remained independent. If that is not an option, then I suggest that all ARM licensees form a consortium with strong governing principles that continue the current liensing model, and make sure that this consortium is owned by all licensees that spend in excess of $10M annually on ARM licenses. That way license fees to some extent return to the owners while at the same time making sure that the ecosystem persists, and that new customers of the ARM architecture can be onboarded.

    I say this not because I am a fan of the ARM architecture. The opposite is true. I vastly prefer the open RISC-V ISA over ARM and x86-64 and other processor architectures, but for the sake of diversity I'd like to see ARM stick around.

    Resilient Secure Boot

    Posted on by: Axel Kloth

    I have been asked many times now why I do not think that the so-called secure boot is in fact secure. It is not exactly easy to understand why even with some newly-added security technology current secure boot schemes are not a guarantee of full firmware boot image integrity. Once the bootloader has loaded all necessary components and has started loading the operating system from disk we’d like to be certain that no malicious code has been installed or is even running below the level that the operating system (and therefore the malware scanners) can detect.

    Here is how “secure boot” technology works. The CPU will start executing code from a very specific memory location once RESET is de-asserted. This location is typically located within the so-called real mode address range that is only accessible prior to entering virtual mode, and it is associated with a physical address that is located at the top of a 20-bit or 32-bit addressable address space. An SPI Flash chip is normally mapped via memory-mapped I/O to this address. Since the processor does not know how to read from a device that is not main linear memory, it needs microcode ROM inside the processor to be able to access and read and execute code within the SPI Flash. This microcode cannot be altered and is created during the manufacturing and final test of the processor. This microcode instructs the processor to read and execute the code within the SPI Flash, and as a first step, the processor “verifies” the integrity of the entire SPI Flash chip. It reads the entire contents, called the firmware “image”, computes a hash value, and compares that hash value to a value found at a specific location in the image itself, or to a value that is stored in the Trusted Platform Module (TPM). While this sounds like a secure solution, there is nothing that stops anyone from altering the verification code in the SPI Flash to simply skip the verification, and declare the contents valid and complete and authentic. In other words, the verification is a bit like Münchhausen pulling himself out of the mud by his own hair. If you trust the verification that is run in software from an unverifiable source, then yes, this is secure. If, however, you assume that the verification code can be circumvented and entirely skipped by a programmer who “updates” your firmware image in the SPI Flash, then this scheme does not seem like it is capable of guaranteeing that the image being booted has never been compromised. For the sake of the netizens' digital security on the Web I certainly hope that resilient secure boot becomes the industry standard. We call it Assured Firmware Integrity/AFI™ and Resilient Secure Boot/RSB™ because we believe that the term secure boot is overused and at this point in time useless. Most secure boot schemes are in fact not secure. We can assure integrity of the boot code, and we can guarantee that the firmware has not been modified by an unauthorized entity or with unauthorized firmware images. At the same time, we can guarantee resilience as we can boot from a golden image in case an update failed or was corrupted.

    Pat Gelsinger (finally) in as CEO at Intel

    Posted on by: Axel Kloth

    Pat Gelsinger was the first CTO at Intel and the architect behind the i486 and should have become CEO when Paul Otellini stepped down. Instead, a string of non-technical CEOs were installed, and essentially, all of them failed. While Intel's BoD finally came to its senses, the question now is if it is too late, and whether Pat can catch up with more than a decade between when he last was responsible for chip and process node development and now.

    Let's hope he still can identify tech talent so that he can refresh the current second and third level of management. It is direly needed. Intel needs to catch up both in processor architecture and in process technology. That's a huge undertaking.

    Most of Intel's expensive acquisitions (more than $33B) under the non-technical CEOs have not returned any investment, and it is going to be up to Pat to reverse those mistakes, fire those that came on board without the required technical understanding, and restore competetiveness with AMD and ARM (and to some degree RISC-V) on the processor architecture side while at the same time battling TSMC for process node development. This is going to be an uphill battle, and it is made even more delicate as Intel needs TSMC to produce some of its processors. Drawing the lines between being a customer and competing with a supplier is going to be tough.

    I wish Pat all the best simply because if he is not successful, Intel will suffer, and semiconductor manufacturing in the US will vanish.

    Our pledge for actions

    Posted on by: Axel Kloth

    We pledge that we will not do business with anyone who has been actively involved in the attempted coup and insurrection in Washington, DC on 2021-01-06, including the organizers and enablers of President Trump, as well as those Senators and Representatives who objected to the certification of the votes of the Electoral College, nor with any company that has appointed any such person to a Board of Directors or into an advisory position. We will prepare a list of persons that can be obtained by emailing us. These people and organizations will be permanently banned from doing any business with us.

    For any job applicants we will require that they certify that they have not been part of the insurrection in Washington, DC on 2021-01-06.

    Maybe the Senators and Representatives that serve the public have forgotten what the Constitution looks like, what it is all about, and what legal repercussions it has if it is disobeyed. In case they don’t recall that there is a government printing office through which the US Constitution is freely available, here is a link to the US Constitution in a version from 1993. Another link to a different annotated and commented version with analysis of its legal impact and ramifications is here.

    We have changed our T&Cs

    Posted on by: Axel Kloth

    In light of the attempted coup by the sitting President of the United States of America to overthrow the freely and fairly elected Biden/Harris government, we have decided to update our terms of engagement and our code of conduct with prospective employees, customers and partners.

    We will update this section asap.

    Donald J. Trump - a clear and present danger

    Posted on by: Axel Kloth

    Donald J. Trump has proven to be the clear and present danger to the US, its Constitution and its people as many of us had feared. It is my civic duty to request that our elected officals act and remove him from office and impeach him so that he cannot start a second coup attempt, start a war, or bring upon us even more civic unrest than he incited already. The insurrection he incited and asked for brought us to the brink of disaster, and we cannot afford it to have him serve out his term as it simply is too dangerous to do so. He has no checks and balances in place and still holds the nuclear codes and the "Commander in Chief" designation.

    HPE and Oracle leaving Silicon Valley

    Posted on by: Axel Kloth

    Recent news indicate that Oracle and HPE have left Silicon Valley. That's not entirely true as all these companies did was move their HQ to Texas. As far as I can tell, there were no layoffs or office closures within HPE or Oracle. I see that purely as a tax avoidance strategy. I do not foresee any impact on the workforce once the pandemic is over and employees return to the office for work. It is troubling in a way that clearly those companies did not see continued value in being headquartered in Silicon Valley, but I do not think that one should infer that Silicon Valley has lost its mojo. It is expensive here, and it always has been, but the pool of talent is deep and broad. While other cities (and regions) may have caught up in some areas, I still think that entrepreneurial spirit and availability of funding sources combined with the depth and breadth of talent in the San Francisco Bay Area is unparalleled.

    The departure of Palantir is different though. While I am not a huge fan of Peter Thiel or Alex Karp I have to admit that there is some truth in why Thiel, Karp and Palantir all left Silicon Valley. To some degree Silicon Valley has become its own echo chamber, and there is a downside to that. If we agree that a democracy has to have the right to defend itself (i.e. the nation and its people) against adversaries, then we must be able and allowed to put that defense infrastructure in place, without a social or other stigma attached to those who work there. If lawmakers and justices agree that there is a threat from foreign and domestic groups that threaten to overthrow our way of governing ourselves, then of course we must have the means in place to conduct surveillance operations against those involved. That does not invalidate my opinion on strong encryption, as I have mentioned many times that surveillance of the endpoints and the existence of communication can be sufficient to establish a reasonable cause for a search warrant. Wholesale surveillance and archival of communication is not and should not be synonymous with this goal, and that is why I am split on Palantir. The NSA might have developed the technology anyway, but Palantir certainly accelerated the possibility of complete and total surveillance. Coming back to my echo chamber comments, I think that while it is great to have a noble goal (“Do no evil” comes to mind), sometimes that has unintended consequences. I’ll expand on this in a subsequent blog entry.

    It has become very expensive to live and conduct business in Silicon Valley, and there cannot be any doubt about the fact that it has started to drive away talent and a lot of the supporting populace to make the Valley hum. It is questionable if we get our money’s worth here if I look at the highways, the public transportation infrastructure, even Internet access at a decent Quality of Service and cost. For unmanaged 1 Gigabit per second links without any decent QoS and availability (and robustness during disasters) the going rate is over $100 per month, and that is in international comparison too high. Managed full-duplex GbE is still in excess of $1000 per month, and that is vastly overpriced. It is ridiculous to see that in the middle of the region that brought us all of the ASICs and the devices that make communication available, this very access is not competitive with the rest of the world.

    Attack on the Internet infrastructure in 2020

    Posted on by: Axel Kloth

    I want to comment on the so-called SolarWinds hack as there is a lot of misinformation out there, and it is not due to the fact that reporters intentionally mislead, but due to the complexity of the case. I would like to simplify it as much as possible without making it false to explain what happened here.

    For one reason or another, penetrating a single server, a single data center or a single cloud provider was not enough for the attackers. They wanted full control over the IT infrastructure in their target countries.

    This was either impossible to do with traditional methods, or too time-consuming.

    To understand this attack and its scope, it is important to know that data centers have grown so huge that they require their own set of tools to just deploy servers, prepare them for customer (or in cloud speak, "tenant") use, administer them, check on their health status, and maintain them as well as to take them down if they need replacement. These OAM&P (Operation, Administration, Maintenance and Provisioning) tools - in this case Orion - are complex all by themselves, and they are usually built the same way that all other Internet backbone tools including Linux are built: with a collection of a very large number of pieces of source code, some of them proprietary and in-house, and others open source.

    That is where this attack originated. The attackers hijacked some of these pieces of software and added code to them that constituted a so-called "backdoor" once their code was included in a new "build" of the OAM&P software. In other words, once the new tool was built (compiled and linked), their code allowed them to use the backdoor to access the servers through the OAM&P mechanism. That is very hard to detect as the tool does what it was intended to do, but for people who were not supposed to access it. The only way to detect an attack like this is by using behavioral monitoring. In other words, if I use the tools to conduct OAM&P and that is my profile of use, and then someone with illegitimate credentials uses the same mechanism (IP addresses, port numbers, APIs and more) to conduct surveillance and to access net user data, then the profile is different and should (and can) be detected as illegitimate and be flagged as a breach.

    However, it took SolarWinds and FireEye as well as Microsoft more than six months to detect this breach, and effectively end new infections (but likely not eradicate all existing backdoors). I rarely praise Microsoft, but in this case they did exceptional work once this breach was suspected. Microsoft then followed through with stopping the attack and helping affected users to preserve evidence as much as possible to submit this to law enforcement. FireEye - while itself being a victim - assisted Microsoft and others in evidence collection and analysis of the logs of victim machines.

    The challenge that all victim organizations are facing now is that it is unclear if the attackers managed to install backdoors in the servers themselves, not only in the OAM&P servers in the network management centers (NMCs). The number of servers in the NMCs is limited, and worst case they can be physically replaced. However, it is not clear if the breach went so far as to compromise the servers that deal with customer or tenant data. In other words, at this point in time we cannot trust in the integrity of the servers in the affected organizations - and those include the US Treasury and the administrators of the US stockpile of nuclear weapons.

    Only after a thorough review of all software on the servers that might have been affected can we be certain that no additional backdoors were installed. That is a monumental task as likely hundreds of thousands of servers must now be assumed to be compromised.

    The question to ask is what the attackers got out of the successful execution of this breach. First all of, they gained insight into lots of US agencies, and they have been able to steal data possibly indicating the identity of spies and counteroperations. Second, they have suceeded in eroding trust in US and more general Western democratic institutions. Third, they have damaged the economy as some of the Internet infrastructure has to be replaced or at least evaluated for persisting breaches and backdoors. The damage is likely in the tens, if not hundreds of Billions of Dollars.

    Considering that the targets are very clearly the Western democracies, I have to say that the attacks likely originated in Russia or in North Korea. While China certainly has its own cybersecurity and cyberattack units, this breach is very unlikely to come from China as its economic well-being is too intertwined with the economic status of the Western democracies.

    More on M1 and multi-threading

    Posted on by: Axel Kloth

    More and more benchmarks keep appearing for Apple's M1. The more I read about it, the more I am impressed how far Apple pushed the ARM architecture both in terms of absolute performance, but also with regards to its energy-efficiency. This is not only due to a manufacturing process at TSMC that is ahead of everyone else, it is also about how the entire processor is effectively a System-on-Chip with all needed accelerators fully integrated. What I am still missing is the information about how many of the accelerators in fact are fully programmable and by themselves are ARM cores in disguise. I assume it's quite a few of them, and that may be a good portion of its advantage. Surely the CPU cores are impressive, with the little and big ones implemented such that the required performance is achieved by the cores that are most suited to the tasks. While they might still lack multi-threaded performance that large Intel cores provide, the real advantage lies in offload of mundane tasks, and I'd really like to see how much of the offload is done in dedicated cores so that the application cores don't have to take over those chores. I am pretty sure that in another two generations or so Apple will have caught up with Intel even in multi-threaded performance, while retaining its lead over Intel with the intelligent I/O. That would truly end Intel's dominance in compute. The only question is now what AMD does to hold up the x86-64 flag... I am not betting on them to keep it going. I think it is time for AMD to switch over to RISC-V and leave x86-64 behind.

    Performance, Benchmarks and Apple's M1

    Posted on by: Axel Kloth

    With regards to Apple's announcement of the new M1 we are back to discussing performance. Performance is measured by using standardized benchmarks. Even that is a very complex and multi-faceted issue for processors and even more so for computers. Why? Because the number of input parameters is extremely high. If we go back to basics, and discuss basic physical units we want to measure that's pretty simple: If we want to measure a physical unit, we can refer to the methods that NIST and BIPM and others prescribe. For a length, we take the speed of light and measure the time it takes to go from the starting point of the unit under test to the endpoint of it. Since we know the speed of light (it's one of the fundamental constants), the travel times gives us the exact length at a very high level of precision. The same is similarly applicable to areas and volumes, and even for currents and voltages. The definition of the weight is currently being changed from a sample of the Ur-Kilogramm to a volumetric unit with a certain mass per volumetric unit derived from the atomic mass of its material composition.

    Measuring the performance of a processor or a computer is a vastly more complex undertaking.

    The perceived performance of a processor or computer is highly dependent on my application profile, and that is different from yours and anyone else's.

    I do not game. I do compile, but I do more floating-point math stuff. I don't use Word, or any MS office tool. I use LibreOffice, and I use a different browser with vastly higher capabilities for filtering and a different email client than most. So my use and therefore requirements profile differs. Maybe you compile, lookup, use a standard browser and FTP and other tools. Your use of the instructions that your computer (more precisely, the processor's ISA) provides you is different from mine. Your need to access memory likely is different from mine. Your applications may have a different locality of the data going to and coming from memory. Locality has a huge impact on performance as caching algorithms make use of locality to hide the inherent latency of DRAM over SRAM. The locality of data and instructions (or the lack thereof) may have a bigger impact on performance than the processor core.

    Let's say I run 1 MB data sets, and yours are 100 MB. Yours will exceed the cache size of any CPU and will work only if there is locality in your data. If not, your caching will not work effectively since incoming data will overwrite existing data only to be reloaded again. Your performance will be at the performance level of the DRAM. Even if your and my setup and software are identical, I will see much better performance out of the same HW.

    For me, it does not matter if integer performance for CPU X is better than for Y, and I also use more tools that parallelize compute. So a processor with more lower-performance cores with better floating point performance is better for me, particularly if it scales with number of cores. The same is true for Inter Processor Communication (IPC). If my applications and my compiler allow for better use of multiple cores and even multiple processors, because my OS distributes the load better than yours, and yours keeps things on very few cores and only uses inter-process communication for communication between threads and tasks, you will see vastly different performance based on the differences in how the processors and cores deal with IPC.

    Your applications may be different on how much and how often you need to execute I/O functions.

    If my computer needs to execute all I/O in software on the main CPU cores, and yours can offload those functions, yours will outperform mine by a wide margin on all applications that have or create a high I/O load.

    How can a single benchmark capture this? It cannot.

    That is why there are different benchmarks for different things, and that is why it is so hard to come up with a single fair benchmark.

    This is today compounded by the problem that we do not only measure absolute performance, but also the performance per Watt, or the computational efficiency. To sum it up, there is no single fair and correct benchmark in existence today.

    Apple's approach to showing comparisons for typical applications used by MacBook owners is not more or less valid than others; it is one of many that all may render different outcomes. It is not any more or less fair than others either.

    Bitrot revisited

    Posted on by: Axel Kloth

    Yes, you read right. Bitrot. Bits don't rot, you say, and you are right. They don't. The physical media they are on deteriorate. But that is not what I mean here.

    Bits represent data, and that data is used to represent something, usually some sort of contents in some format for some software. If the data is still there but it cannot reasonably be interpreted any more, it might as well have rotted away - and that is what I mean. Case in point: I needed to look up some data from my Diploma thesis, written in 1989 and luckily available in four formats: Microsoft Word from way back then, in LaTex, and as a printout and scan of that printout in GIF. The printout on acid-free paper was still perfectly fine since I had hidden it and it was never exposed to sunlight. No oily fingers leaving prints were allowed to touch it, and so the printout was fine. The scan from way back then as a GIF was importable and readable too, but the prevalent resolution then was fairly low, so the clarity left much to be desired. It was so bad in fact that I considered it rotten. Today's Microsoft Word did not even touch the old Word file - it simply could not open or import it, even with all of the file format conversion tools that MS has available. It did not open it, it did not display it, and it did not print it. I considered that rotten too, although all the bits were still perfectly preserved. By chance, I opened it in OpenOffice, and OpenOffice opened it, displayed it, and was able to print it, albeit at the loss of pretty much all format, and the embedded graphs were lost. All of them. So I still considered the Word format a total loss - complete bitrot. LaTex fared a lot better. All text, format and formulas were still there. Most of the embedded graphs could still be displayed, with the exception of a few plots created by an IEEE488-attached plotter (yes, back then that was modern...). I considered that a winner for a while, only to find out that if I scan the printout and then run some decent OCR on it, I was able to recover close to 100% of all contents, text, graphs, formulas and nearly everything else. So what's the moral of the story? Do not believe in the promise of a major piece of software creating something that can still be interpreted 30 years down the road. It will create something that is susceptible to major bitrot. If it is important, paper and printouts still rule. Even if you can preserve the bits and the file itself, it is going to be borderline useless if the application that was used to create the file is not available any more (or does not run on modern hardware). In other words, if you create documents that must survive for more than a century, and if they are important, print them. Maybe PDF/A might be an alternative, but I am not sure about that...

    Why do I bring it up now? Very simply because I see that there is a major change in Operating Systems, applications, and of course the type of devices that we use to read, write, browse and process data. Applications and flat-hierarchy proprietary formatted ASCII-type data formats from the old single-user, single-tasking Operating System days on clunky PCs with a 640 * 480 screen resolution have long given way to rich data formats with or resembling HTML on multi-user, multi-tasking Operating Systems with quad VGA screen resolution at least even on smart phones.

    Apple's M1

    Posted on by: Axel Kloth

    Apple announced its new Macbook Air, the Mac mini and the Macbook Pro yesterday. All three are powered by a big.little combo of ARM processor cores, a set of accelerators, and very high bandwidth memory interfaces. This is Apple's first internal design of a laptop processor, and it is way beyond the smart phone processors of the A line. The new processor dubbed M1 is seriously impressive, both in absolute performance and in performance per Watt. I had not expected such a performance leap over the incumbent Intel processors. While Apple did not quite disclose which benchmarks they used, the notion of a 3.5 times performance gain is impressive. Even more so if in fact the power consumption is reduced by 50% compared to the prior Intel processor.

    ISA versus system architecture

    Posted on by: Axel Kloth

    Over the past 40 years we have seen processor families and ISAs (Instruction Set Architectures) come and go. Not always did the "better" solution win.

    That was fairly apparent with Intel's original 8086 versus the Motorola 68000 family, then again with the 80286 against the 68020 and 68030, and it happened with the SPARC and MIPS versus the DEC Alpha. We saw the RISC versus CISC wars, and luckily all of those wars have died off.

    It is possible to design an excellent CISC Processor the same way it is possible to create a subpar RISC design - and vice versa. The instruction sets do not really matter. All that counts is how many instructions per second the final product executes, how much energy it uses to do so, and what the silicon area is that it is consuming. Another important metric is how much the processor design burdens the software and compiler developers, as Intel had found out with its Itanium and EPIC (Explicit Parallel Instruction Computing). Most processor architectures have by now faded away, and we are left with x86-64, ARM in its various guises, and of course RISC-V. SPARC and POWER have seen a very dramatic decline in use, and while open source variants exist, they don't seem to gain much traction. MIPS has been undead for a long time now, but its new Chinese parents may want to (or have to) revive it.

    As a result, I believe that the ISA and CPU architecture wars are over and behind us, because very simply, neither the ISA nor the CPU (core) architecture matter. What matters is how easy it is to integrate accelerators and coprocessors, as these units will do the majority of the work. The CPU will become just an orchestrator. That is why I have high hopes for RISC-V (although even that ISA has issues that drive me up the walls).

    A good portion of the tasks in a computer can and should be dealt with by offload engines or accelerators. Intel claims that its processors are fast enough to take over all computational tasks in a computer, and that no accelerators or offload engines are needed. In all fairness, Intel tried twice to decentralize compute. The first time they tried it was was the 8086/8087/8089 combination, where the 8088 or 8086 was the main CPU, the 8087 was the math coprocessor, and up to two 8089 could included for offloading I/O tasks. The problem was that there was no operating system making use of this, and the idea died. The second attempt started better with I2O, and this time the I/O offload engine of choice for each I/O intensive peripheral was the i960 and later on the (ex DEC) StrongARM. Intel made an effort to add software and drivers, but that effort ultimately also failed. For ARM, the story was the opposite. The ARM core simply could not cope with any I/O, and it required offload, so even smaller and cheaper ARM cores were integrated into everything and anything that needed a performance boost due to demanding I/O, a complex protocol, or offload for a variety of other reasons. These became the "embedded" processor cores, and they used a variety of different Real Time Operating Systems (RTOSes). That led to the fact that ARM-based laptops today can compete performance-wise with x86-64 based laptops.

    Apple and ARM

    Posted on by: Axel Kloth

    Over the past decades Apple has taken over most of its suppliers and most verticals providing services to it. It is hard to imagine that Apple will not do that for the most important piece of IP powering their products, the CPU core.

    Apple’s legal battle with Qualcomm that for reasons incomprehensible for industry insiders Qualcomm won, is a showcase for Apple’s need to be in control of all of its supply chain. Qualcomm sold not only its wireless WAN chips to Apple, it also required Apple to pay royalties for that exact same IP included in those ASICs. For non-technology readers, an equivalent would be if someone buys a house, and then has to pay usage fees to the prior owner of the house. That’s incomprehensible and hard to understand.

    Apple first stood up to Qualcomm by suing Qualcomm for essentially double-dipping. Apple won in the first instance, and many in the industry believed that Apple won rightfully. However, Qualcomm appealed, and in the appeal it won. Most technologists and IP lawyers would probably agree that the US appeals court made a grave mistake here. In either case, Apple decided to double down and bought the former Infineon wireless WAN group from Intel and redoubled its efforts to build a 5G modem and bring this technology in-house. For the time being though Apple had to revert back to Qualcomm and use its technology and the ASICs for its iPhones. At a market capitalization of $2T I doubt that Apple will continue to use Qualcomm chips in the future.

    This illustrates to which length Apple will go to take control over its suppliers. That is why I am so surprised and confused why Apple did not buy ARM. With nVidia owning ARM, this story may repeat itself. Apple could have easily bought ARM. Certainly Google and Samsung would then have moved to an alternative, as would most other smart phone suppliers, but the transition to another CPU architecture would have been upon them, and not upon Apple. I do not understand this. I had bet that Apple – prior to the sale of ARM – would transition over to RISC-V, but I lost that bet. Apple did not consolidate ARM and x86-64 to RISC-V, it stuck with ARM. Now what does Apple do?

    nVidia buying ARM

    Posted on by: Axel Kloth

    Here is my perspective (and of course I may be wrong. I have been wrong in the past on this). nVidia acquiring ARM has direct and short-term as well as long-term implications on the entire semiconductor industry as well as on all related industries. I am going to highlight them and point out why I think that way.

    After full integration of ARM into nVidia there are going to be issues with the licensing model.

    First, in my opinion ARM is overvalued today if one applies usual economic guidelines. ARM's licensing revenue does not support a $34B or $40B valuation. I'd peg ARM's valuation at best at around $15B (seven times annual revenue), considering annual revenues of under $2B.

    nVidia therefore did not buy ARM for future revenue attained by licensing out ARM architecture cores. Paying $40B for an asset that is worth at best $15B if future revenue is taken into account is stupid. Jensen Huang is not stupid. He is brilliant. As a result, we can assume that Huang bought ARM not because of future licensing revenue.

    He bought it because he will have an asset that can be used to put pressure on Qualcomm, Samsung and - most importantly - Apple. Huang saw that Apple caved in the licensing and WWAN radio (both 4G and 5G) spat with Qualcomm. He'll do the same to Apple.

    Apple's entire iPhone and iPad product line is predicated on ARM. The announcement was just made that Apple's laptops will follow, and one could reasonably expect that Apple will discontinue all desktop and server products. In other words, Apple is an ARM house, and one with a very large ASIC and processor development department in-house. Apple likes to have all of its suppliers under very tight control, or absorb them. Current Apple products contain many to probably dozens of ARM cores, in a wide variety of functions inside all of their chips. Consolidating all hardware on ARM and all software onto one RTOS and one OS - presumably iOS at this point in time - will help Apple save money.

    I had predicted Apple would go the RISC-V route. I was wrong. Apple went ARM. Now, Apple really does not have a choice any more: either transition again, this time to RISC-V, or be under the control of Huang's nVidia.

    Many of you will disagree, and I have heard the arguments: nVidia will continue to play nice and license out ARM cores. That might be the case, for a while, to lure everyone in. However, does anyone think that Apple will now buy pre-packaged ARM cores designed by nVidia? What will Samsung do? Or Qualcomm? They won't because there will be no distinguishing features. An Apple or a Samsung phone on par with an HTC phone, or an arbitrary phone from a Chinese supplier, or any other phone? That is not going to happen.

    Second, most users do not care about absolute performance, or even if they do, they see it as a trade-off over power consumption. While today's modern in-order CPUs all achieve roughly the same number of instructions per second, the picture changes if the instructions per second are normalized against power (or out-of-order processing or simultaneous multi-threading are included). On top of that, accelerators have proven to be useful and more efficient than CPUs, and currently nVidia's proprietary CUDA GPGPUs are the weapon of choice for AI and HPC acceleration.

    If nVidia fully enters the automotive L4 and up fray, with the ARM cores working in conjunction with nVidia GPGUs for display and AI accelerators, it will dominate that industry very quickly. nVidia can now tackle automotive, the phone industry, and the entire HPC market with ARM and GPGPUs as well as x86-64 working with the GPGPUs all by itself. It does not need partners or other suppliers. In other words, ARM can be everywhere where there is the need for compute. ARM covers the entire spectrum from feature phones to smart phones, from tablets to laptops, and in servers as well as in HPC and of course in automotive and in all embedded systems. The only exception would be desktops, but those are going away and provide a woefully low margin. Hard to believe, but x86-64 might soon be the cheap solution, and not the default (performance) choice.

    All of nVidias customers will be their competitors. And in this case, there is no friendly "coopetition".

    I am not sure why Tim Cook did not see this, and why he did not buy ARM. His exit paths out of the conundrum now are worse. Switching to RISC-V immediately risks a lot of business and creates a trust issue for Apple. Staying with ARM exposes Apple to risk from nVidia.

    nVidia as well as Google and Western Digital were the largest RISC-V proponents. Nothing at nVidia happens without Jensen Huang's approval. So Huang knew well that RISC-V is a threat to ARM. Thus, I can only surmise that nVidia bought ARM for exactly one reason.

    However, this has other implications as well. This brings me to the third argument, and that is the number of processor core choices available. What most casual users fail to understand is that the supporting ecosystem of a processor core is directly related to the number of users. The more users there are, the larger and the more robust the ecosystem is, and the larger and the more robust the ecosystem is, the easier it is to design and deploy it.

    If ARM cannot be used by Chinese or other players because ARM now falls under CFIUS jurisdiction, alternatives must be used. There are not many:

  • POWER and PowerPC have been and still are on a steep decline, and no one knows what is going to happen after IBM's spinoff of its infrastructure play
  • MIPS has gone to China and thus can be used there, but its ecosystem is dwindling
  • Sun's (or now Oracle's) SPARC simply does not measure up and it is not open source for those versions that could be used in servers
  • RISC-V
  • Out of all of those, only RISC-V has a growing ecosystem, has no export restrictions, has the performance needed to compete with ARM and x86-64 in every aspect, and is open source (well, the Instruction Set Architecture is, not any given implementation).

    In summary, I believe that the real winner out of all of this will be nVidia as they can make money short-term from existing ARM licensees, but long-term, RISC-V will win as x86-64 is relegated to desktops without a margin, and to legacy server backends.

    US DoJ on Encryption – again

    Posted on by: Axel Kloth

    Yet again the US Department of Justice (DoJ) tries to pitch End-To-End Encryption against Public Safety. The reality is that the opposite is true. There is no Public Safety without End-To-End Encryption. Predictably, the DoJ brings up exploitation of children to justify restricting the use of encryption. Encryption relies on secret keys or key pairs. The algorithms are standardized. For backdoors to work, a repository of keys and key pairs has to be created. This database will be the most-targeted piece of property ever, as it would reveal all keys from everyone to everyone else using encrypted communication. Whether this database is a collection of databases by each provider or a centrally and federally managed database does not make a difference. It will be breached. I do not want to go into any more detail here, and anyone who wants to dive deeper is invited to ping me. I promise to return email requests. I'd like to make it very clear: Backdoors to encryption are not needed and are dangerous. This renewed attempt of pushing legislation through that restricts encryption must be stopped.

    AMD buying Xilinx

    Posted on by: Axel Kloth

    It looks as if AMD is making a bid for Xilinx. This might be copying what Intel has done with Altera, and that outcome was not particularly good. It might also be based on better insight at AMD, or a different vision of AMDs' corporate leaders for its future. I am still scratching my head as I cannot figure out why AMD would want to buy Xilinx. Xilinx is a great company, but I fail to see the synergies. AMD is a great company in its own right as well, but other than investing what the stock market have given AMD over the past 12 to 18 months, I do not see where Xilinx would augment AMD.

    Let's check the individual strengths of each company. AMD has over the past few years caught up to Intel with regards to the x86-64 processor architecture, and arguably passed it. This is not only because Intel is behind in semiconductor manufacturing technologies versus TSMC, which makes AMD's processors. The more important aspect is that AMD's architecture appears to be superior to Intel's. That's quite a feat. Its other product line, GPUs, are lagging behind nVidia's. They can't seem to catch nVidia in total performance, nor in performance per Watt. nVidia has also monopolized acceleration by GPGPU compute by using a captive and proprietary standard, CUDA. AMD (ex ATI) does not have a CUDA competitor, not is any meaningful open source alternative (such as openCL) supported by AMD. nVidia uses its advantage to the maximum possible extent, and it is now venturing into the automotive market and into AI and ML (Machine Learning) with its CUDA-supported products. AMD/ATI has nothing to counter that. If we add Xilinx to the mix, it does not change the equation one iota. Xilinx has excellent FPGAs, and its software to program the FPGAs (ISE and Vivado) is ahead of Intel/Altera's offerings, but it does not compare to CUDA. Xilinx FPGAs also don't offer a direct interface to any AMD server processor family at the socket-level, to increase interprocessor bandwidth and reduce latency between the CPU and the FPGA accelerator. Both would be needed to guarantee a performance advantage over PCIe-attached GPGPUs.

    nVidia on the other hand is buying ARM. Whether that means that nVidia is going to drop RISC-V or not remains to be seen, but if one assumes that accelerated compute is the future, then nVidia is much better positioned as GPGPU compute is synonymous with nVidia and CUDA. In other words, if solutions three years from now are made up of 80% GPGPU and 10% CPU and another 10% for miscellaneous other stuff on the die, then nVidia will take the lead and kill off x86-64, independent of whether AMD buys Xilinx or not, and how well the integrations goes. The logic behind the AMD/Xilinx tie-up escapes me. If you have an idea why send me an email to info at abacus-semi.com.

    Oracle kills off SPARC and Solaris

    Posted on by: Axel Kloth

    Oracle had bought Sun Microsystems a long time ago. Considering Oracle and its past behavior, I am amazed that it did not kill off Solaris and SPARC (a RISC processor architecture) right after the acquisition. SPARC was an acronym, and it stands for Scalable Processor ARChitecture. The problem with that is that it is not scalable and never was. Anyone trying to scale out a large number of SPARC cores will have run into that problem. Niagara could not fix that. As a result, I am not surprised that SPARC finally hit the end of the road, and if my contacts into Fujitsu are correct, they gave up as well.

    So I think it is fair to say goodbye to SPARC. Rest in peace, and please don't come back.

    Solaris, on the other hand, was a distinguishing feature. It was not better than other UNIX, but it was proprietary. I had assumed that Larry Ellison preferred proprietary and captive products as they prevented a customer from migrating to better solutions. Well, I was wrong, and Oracle of late has embraced Linux. I am not sure if that was due to the lackluster performance they got out of scaled-out SPARC-based data centers running Solaris, or if Oracle simply got sick of supporting Solaris when in reality most Linux installations performed at least as well. Since Solaris did not have any OS extensions that would have made it a natural choice for any of the other Oracle offerings, it simply became a burden instead of an asset. Solaris, may you rest in peace, and please don't come back either.

    Polling versus Interrupts

    Posted on by: Axel Kloth

    During the design of every embedded processor, and even for quite a few accelerators, a stone-old debate is revived. It is about real-time operating systems (RTOSes) versus multi-user, multi-tasking systems, and it is about polling versus interrupts. Those concepts may sound strange and abstract, but they are actually very simple to understand. Let me use a house and doors and windows as an analogy, and let us assume the case in which doorbells exist versus the case in which doorbells have not yet been invented.

    Let's start with a system that uses polling. The analogy would be me walking around the house and "polling" (i.e. checking the status of) every door and every window of the house, at preset times. If I poll the doors, I need to walk around in a circle that covers my desk and all doors. That is a piece of linear software that loops back to where I started. I could therefore write that as a linear piece or as a loop in a "for all doors do the following:" scheme - the compiled code will be the same or nearly identical. It does not matter how I do it, it is fairly low effort to write that software. The problem with that is that I cannot prioritize doors, and while I am in the loop executing the check, I might miss someone who is at the door that I had just visited prior, possibly (and even fairly likely) missing that visitor. Another danger is that someone chats me up with seemingly important stuff, and I miss my schedule (or the entire cycle).

    In an interrupt-driven system I can prioritize doors (for example, the doorbell for emergency personnel has a different pitch and cadence than the one for the mailman bringing invoices and checks, and that again differs from the one for the least favorite salesman from the least favorite company we buy products and services from). That way I can deal with visitors at multiple doors easily. Let's say I am sitting at my desk and the invoice-delivering mailman rings the bell. I need to write down what I am doing so I can resume that task once I am back. I then proceed to the mailman to greet him to fetch my invoices for the day. On the way to the door, the emergency doorbell rings, and so I take note that I was on the way to the invoice-delivery mail door, proceed to the emergency door, and let the emergency personnel in while disabling all lower-priority interrupts, or have them queue up in the interrupt request queues. If my presence is required, I stay with the emergency personnel for as long as I am needed, and if I am not needed, the emergency door alarm is marked as being dealt with, the lower-priority interrupt requests are re-enabled and I can resume my prior task, once all other pending doorbells are dealt with. Since the invoice-delivery doorbell handling task is still pending completion, I can proceed to the mailman, fetch my invoices, mark that as finished, and resume my desk task. An additional benefit arises when I can mask the interrupts I do not want to deal with at a point in time. If I am busy with finishing some important work, I can either disable certain doorbells, or I can make their requests pending but do not allow them to interrupt me. In other words, the annoying salesman from company C will ring the doorbell and hear that it rang, but since I have some important stuff to do for the next 30 minutes, that request will stay pending. He can chose to stay for the 30 minutes, or post a note and leave.

    By using an interrupt-driven system, I am in control of when I can and want to be interrupted, which interrupt requests I take, and I can guarantee that high-priority requests are always dealt with first, even if multiple interrupt requests arrive at the same time or in a time-staggered fashion. Even if so many interrupt requests arrive at the same time that would exceed the time I have available to process them, it will not bring the system or me into a situation that puts me in overload as I simply disable all interrupt requests from lower-priority sources and either mark them as pending, or disable them entirely. Hence, it is impossible to try to allocate more of my time to check the doors than I have available.

    Let's evaluate what the overhead is: I need what is called an advanced programmable interrupt controller (APIC), and upon each incoming interrupt request I need to write down what I did when I was interrupted, and once that is dealt with, resume that task and shred the note that I took to remind myself what I had done. That is all. The APIC is a fairly trivial piece of hardware these days, and the overhead of taking note what I was in process of doing is no different than a branch to and return from any other subroutine other than that it is privileged. In computer speak, any CPU core can do it. In fact, all of it can be done by an I/O coprocessor within a heterogeneous MP system.

    Performance and I/O Bandwidth revisited

    Posted on by: Axel Kloth

    I would like to come back to an observation that stated that for a processor to achieve 1 FLOPS of floating-point processing performance, it would need to provide 1 byte/s of memory bandwidth. While Gene Amdahl disclaimed this particular observation, I think it is important to point out that there is a logic relationship between performance of a processor and its I/O bandwidth - this would be Kloth's Second Observation if unclaimed or disclaimed by Amdahl or collaborators. One FLOPS is defined as 1 floating-point operation per second. Today, we assume that all floating-point operations are 32-bit floats (minimum requirement) or 64-bit floats (more mainstream today). Most floating-point operations require two operands and produce one result, and the instruction is issued by one instruction word of 16 bit, 32 bit or 64 bit length. Most superscalar processors will need 64-bit instruction words to issue all instructions that are supposed to be executed in the following cycle. In our example, we will see two 64-bit operands being transferred in, one instruction word (64 bit) being transferred in, and one 64-bit result being transferred out. If any instruction is supposed to take one cycle (in superscalar architectures, greater than one instruction per cycle is normal) then any FPU will create one result per instruction and thus per cycle. In this example, we will need 3 64-bit words in and one 64-bit word out per cycle. That is four quad words per cycle, or 32 bytes of I/O per cycle. That's 32 bytes of I/O per FLOP, or if we divide all units by the time unit 1 s, then it is 32 bytes/s of I/O per FLOPS. This situation is aggravated in any processor that contains multiple superscalar cores and provides a very low number of DRAM interfaces. Only in situations in which compound and very complex floating-point instructions are executed this would not apply. I can think of sincos and tan as well as the hyperbolic functions as well as log and exp functions. What implications does that have? You won't see 3 TFLOPS of floating-point performance in a Xeon with a mere 200 GB/s of memory bandwidth. Our numbers are not theoretical peak performance numbers - they are numbers that indicate sustainable performance.

    Most software will end up as a piece of hardware

    Posted on by: Axel Kloth

    First, I am glad the RISC-versus-CISC wars are over. Second, I am equally glad that the best of both worlds has come to dominate processor design. We now have processors that support large instruction sets (and not surprisingly, the largest instruction set can be found in a RISC processor), and some of those instructions point to hardware engines that execute certain functions as native and possibly atomic instructions. Those functions are complex mathematical operations that were unthinkable to be implemented even just a decade ago. FFTs, matrix multiplications and most video codecs are commonplace functions in processors and DSPs today. They used to be macros or function calls comprised of hundreds of simple native processor instructions just a few years ago. So my lemma and theory is simply that every software library function that has proven to be useful will end up as a piece of hardware. In essence, it also means that the ISA will become less important. Our processor cores are based on the open Instruction Set Architecture (ISA) RISC-V. Our RISC-V cores are modified in a lot of ways from the original UC Berkeley and SiFive RISC-V designs. While they are fully compliant with the RISC-V ISA, we focused on simplicity, scalability and performance instead of ease of design. We have also modified the "uncore" parts of the processors while retaining full compatibility so we don't have to rewrite or modify drivers. Unlike other approaches, we decided to not extend the ISA either. However, we could have designed our processors with a different core and a different ISA, and I believe it would have no meaningful performance impact.

    As a result, users and programmers can rely on industry-standard compilers, SDKs and APIs and do not have to worry about instruction sets.

    Hardware versus software

    Posted on by: Axel Kloth

    One of my favorite pieces of advice from Albert Einstein is to "make everything as simple as possible, but not any simpler than that". Software is not simple. Software is great to test out new algorithms or new ideas that might not be able to be expressed algorithmically. However, once performance, robustness and security metrics and power envelopes come into the game, software is at a disadvantage against hardware. The world's programmers write more and more software. There cannot be any doubt about it. However, that software runs on more and more differentiated and dedicated hardware than ever. We not only have CPUs and GPUs (rebranded to GPGPUs), FPGAs, TPUs and Security processors, there are also more and more offload engines for network traffic. In other words, functions that we used to execute in software are now available as hardware building blocks, invoked by simple driver commands. Therefore my lemma is: "Any problem that can be described algorithmically, through table lookups or a combination thereof will be implemented as a piece of software unless it can be executed more efficiently in dedicated hardware". That statement also leads directly to Kloth's First Observation. If more and more software is being written, then (atomic) instructions will have to cover more and more of the tasks executed to save power and energy, reduce latency, and improve performance. MMX and SSE as well as H.264 video codecs are only one set of examples. Kloth's First Observation can be summarized as follows: "Any sequence of instructions that is proven to be deployed to a degree greater than a preset threshold in a typical application mix is replaced by dedicated hardware, controlled by a Finite State Machine dedicated to this sequence of instructions, and invoked by a new instruction word. The threshold is determined by goals for performance and latency as well as by the need to reduce power and energy consumption". If George Haber's original Observation ("If it can be done in software it will") were true, then we'd be back to the original Turing Machine, where no hardware assist existed for anything. That would mean that our processors execute in, out, inc and bne (Branch on Not Equal). That's all that's required, but of course it's not practical. While we could build a processor like this today in very few gates and with today's SiGe or GaAs process technology probably in the range of 60 - 100 GHz, it would not be practical, it would not perform well, and its power consumption per (compound) instruction executed would be well above the threshold that anyone would accept today. Instead, we keep adding instructions and dedicated hardware to an instruction set of a processor, and sometimes we even need an entire new class of coprocessors for specific tasks because the memory bandwidth of a CPU might not be compatible with the needs of that coprocessor, and because there may be real-time requirements that for example a DSP can fulfill, and a CPU cannot. So while the breadth of software is increasing, so does the world of hardware accelerators.

    Recent events at Abacus Semiconductor

    Posted on by: Axel Kloth

    I am happy to announce that Abacus Semiconductor Corporation has been formally established. We have identified team members we'd like to bring on board, and we will finish that process in the next few weeks. We are going to build out the team to the degree necessary and then announce it by the end of June of 2021.

    Please click here to go to the top.