RSS | technovelty home | page of ian | ian@wienand.org
Today I started wondering what actually happens when you plug in a USB device. The little tour below goes from starting the USB subsystem to plugging something in, and I hope it is reasonably accurate. This entry is probably best read with a copy of the Linux kernel handy.
We can start our tour right at the very bottom in the heart of the USB core.
Things really start at the USB initialisation function in drivers/usb/core/usb.c:usb_init(). The first interesting call is to drivers/base/bus.c:bus_register(). We see that it passes as struct bus_type which looks like:
struct bus_type usb_bus_type = {
.name = "usb",
.match = usb_device_match,
.uevent = usb_uevent,
.suspend = usb_suspend,
.resume = usb_resume,
};
This is registering a new type of bus with the Linux driver core framework. The bus doesn't have much yet, just a name and some helper functions, but registering a bus sets up the kobject hierarchy that gets exported through /sys/bus/ (/sys/bus/usb in this case) and will allow the further hierarchical building of devices underneath by attaching them as the system runs. This is like the root directory of the USB system.
Your desktop/laptop/palmtop etc has a host controller which directly interfaces to USB devices; common types are UHCI, OHCI and EHCI. The drivers for these various types of controllers live in drivers/usb/host. These controllers are similar but different, so to minimise code duplication Linux has a Host Controller Driver framework (drivers/usb/core/hcd.c) which abstracts most of the common operations from the host controller driver.
The HCD layer does this by keeping a struct usb_hcd (drivers/usb/core/hcd.h) with all common information in it for a host controller. Each of host controller drivers fills out a struct hc_driver for its hardware dependent operations, as per below (taken from the UHCI driver)
static const struct hc_driver uhci_driver = {
.description = hcd_name,
.product_desc = "UHCI Host Controller",
.hcd_priv_size = sizeof(struct uhci_hcd),
/* Generic hardware linkage */
.irq = uhci_irq,
.flags = HCD_USB11,
/* Basic lifecycle operations */
.reset = uhci_init,
.start = uhci_start,
#ifdef CONFIG_PM
.suspend = uhci_suspend,
.resume = uhci_resume,
.bus_suspend = uhci_rh_suspend,
.bus_resume = uhci_rh_resume,
#endif
.stop = uhci_stop,
.urb_enqueue = uhci_urb_enqueue,
.urb_dequeue = uhci_urb_dequeue,
.endpoint_disable = uhci_hcd_endpoint_disable,
.get_frame_number = uhci_hcd_get_frame_number,
.hub_status_data = uhci_hub_status_data,
.hub_control = uhci_hub_control,
};
It might be helpful to clarify a few USB concepts now. A USB device defines a group of end-points, where are grouped together into an interface. An end-point can be either "IN" or "OUT", and sends data in one direction only. End-points can have a number of different types:
There can be many interfaces (made of multiple end-points) and interfaces are grouped into "configurations". Most devices only have a single configuration.
You can see how this works at the host controller level with the above diagram clagged from the Intel UHCI documents. The controller has a "frame" register which is incremented every millisecond. Each frame pointer points to a queue of "transfer descriptors". The driver needs to schedule this work so that 90% of the time is given to isochronous data, and 10% left for control data. Should any time remain, the bulk data is transferred. You can see that any transfer descriptor for isochronous data will not be retried, but other data sits in a queue so it is never lost.
The USB layer communicates through USB request blocks, or URBs. A URB contains information about what end-point this request relates to, data, any related information or attributes and a call-back function to be called when the URB is complete. Drivers submit URBs to the USB core, which manages them in co-ordination with the USB host (see the urb_enqueue functions provided by the host driver). Your data gets sent off to the USB device by the USB core, and when its done your call-back is triggered.
There is one more element quite fundamental to the USB core, which is the hub -- all USB devices plug into a hub. The USB controller implements a root hub; you can have multiple root hubs in a machine, and other hubs can then connect to root hubs. The hub driver lives in drivers/usb/core/hub.c. The USB initialisation function starts up the khubd thread (drivers/usb/core/hub.c:usb_hub_init()) which waits for and handles USB events, but we will return to that later.
The hub driver is the first USB driver to be setup, so by examining how that works we can get a feel for how other drivers work.
The hub driver setup starts in drivers/usb/core/usb.c:usb_init() where the drivers/usb/core/hub.c:usb_hub_init() function is called. This calls drivers/usb/core/driver.c:usb_register_driver() which adds itself to the USB bus we mentioned previously. This sets up usb_probe_device() to handle any probe events from the Linux driver core. At this point the hub driver is ready to claim anything that looks like a USB hub.
The root hub setup phase comes out of the HCD setup phase, which proceeds something like this. The Linux driver core goes through all devices on the PCI bus (including the USB host controller of course) and calls the probe() function the device's driver has registered (the host controller registered itself in its initialisation function drivers/usb/core/uhci-hcd.c:uhci_hcd_init()). The USB host controller driver wires up the HCD layer function drivers/usb/core/hcd-pci.c:usb_hcd_pci_probe() to handle this probe (see struct pci_driver uhci_pci_driver in uhci-hcd.c; usb_hcd_pci_probe() does some generic setup, but then calls back into the host driver start() function to do any device specific setup).
usb_hcd_pci_probe() ends up calling drivers/usb/core/hcd.c:usb_add_hcd() which does some generic HCD setup and ends up calling register_root_hub().
register_root_hub() creates a new USB device and registers it with drivers/usb/core/hub.c:usb_new_device(). usb_new_device() first calls drivers/usb/core/config.c:usb_get_configuration which sets up the interface (all hubs only have one interface; the interrupt interface to notify of events on the hub) for the device before registering with the Linux driver core via drivers/base/core/device_add(). device_add() then causes the USB bus to be rescanned.
Now we can examine how a USB device gets associated with a driver. To summarise, what needs to happen is the hub driver needs to bind to the host controllers root hub. This illustrates the general concept of a new device binding to a driver.
There is, believe it or not, more layers that come into play now. There is a "generic" USB driver that handles the setup of interfaces in the USB core. As we mentioned earlier a device has a series of end-points grouped together into an interface, and then may have multiple interfaces for different things. Drivers really only care about communicating with the device at the interface level, so the USB core takes care of getting things to this stage for you.
In drivers/usb/core/usb.c:usb_init() the final call is to drivers/usb/core/driver.c:usb_register_device_driver(). This does some simple wrapping of the driver, most importantly setting up usb_probe_device() to handle any probes. It then registers this with Linux driver core with a call to driver_register.
Remembering that drivers/usb/hub.c:usb_new_device() has called device_add(), the driver core will now see this new device and start probing for a driver to handle it. The USB generic driver is going to be called, which has registered drivers/usb/core/driver.c:usb_probe_device() as its probe function. This converts the Linux driver core device back to a USB device (i.e. the USB device that was registered by register_root_hub()) and calls calls the drivers probe function drivers/usb/core/generic.c:generic_probe().
The role of the generic driver is to get the interfaces on the device up and running. Firstly it calls drivers/usb/generic.c:choose_configuration() which simply reads through the device data and chooses a sane configuration. But wait, how does it know what is a sane configuration for the root hub? All the information has been "faked" for the root hub in drivers/usb/core/hcd.c:usb2_rh_dev_descriptor and the like. The root hub details are defined by the USB specification, so these can be kept statically.
Assuming everything is OK, drivers/usb/core/message.c:usb_set_configuration() is called to set up the chosen configuration. It uses the helper function usb_control_message() to send a message to the device about what configuration mode to go into. It then goes through the available interfaces setting up the kernel structures, and adds them with device_add().
Inch by inch, we are getting closer to having the USB system up and running. When a new interface is added, the driver core will now try and find a driver to handle it. When an interface driver is registered with drivers/usb/core/driver.c:usb_register_driver() it sets the probe function to drivers/usb/core/driver.c:usb_probe_interface(). So the driver core calls this function, and the first thing it does is checks the ID against the IDs the driver is happy to handle from the drivers id_table. It uses usb_match_one_id() for this, which can match the class, subclass and protocol to make sure the it will work with this driver.
The root hub is part of the hub class, so can be handled by the hub driver. Therefore drivers/usb/core/hub.c:hub_probe() will be called to get the hub driver to bind with the new hub. This does some sanity checking and calls into hub_configure(). This then does some more general setup, but things get interesting when the interrupt end-point is setup. The interrupt end-point on a hub will send an event whenever something happens on the hub, such as a device being plugged in or unplugged. This is done by creating a URB and binding it to the end-point, asking that hub_irq be called whenever this URB is complete (e.g. when an event is received).
At this point, the system is waiting for something to happen. The root hub is setup and listening for new events - we are ready to plug in our device.
When this happens the host controller will raise an interrupt signalling that one of its ports has changed state. This will be handled by drivers/usb/host/uhci-hcd.c:uhci_irq(). This checks that the interrupt wasn't due to an error and then calls drivers/usb/host/uhci-q.c:uhci_scan_schedule(). This function implements the interface between the UHCI "transfer data" messages and the Linux URB scheme. It goes through the queues as illustrated in the figure above and finds any complete URBs and calls their completion function.
You may remember that the interrupt end-point of the hub was associated with a URB that would call drivers/usb/core/hub.c:hub_irq(). The UHCI code will go through it's queues, find that this URB is complete and therfore call the completion function.
We previously mentioned that the kernel starts the khubd daemon to handle hub events. We can see that hub_irq does some simple error checking, but its real job is to notify khubd that there is a new event on this hub.
hub_events() is the khubd daemon thread doing all the work. Once notified the hub has new event, it goes and reads the hub status to see what to do. A new device appearing will trigger a port change event, which is handled by hub_port_connect_change(). This does some initial setup of the device, but most importantly now calls usb_new_device().
At this point, if the USB driver for this device is loaded, the device is essentially ready to go. As with the root hub, the device will be probed and its interfaces found with the generic driver, and then those interfaces will be registered and any driver asked if they wish to bind to them.
The question arises, however, about how modules are dynamically loaded. Keeping every single module for every possible USB device that may plug into the system is clearly sub-optimal, so we wish to load a device driver module only when required.
Most everyone is aware of udev which handles /dev device nodes these days. udev sits around listening for uevents which get sent to it via the kernel. These uevents get sent via netlink, which is like Unix sockets but different. To get uevent messages all you need to do is open a PF_NETLINK socket to the NETLINK_KOBJECT_UEVENT "port", e.g. as the code below extract from udev does.
memset(&snl, 0x00, sizeof(struct sockaddr_nl));
snl.nl_family = AF_NETLINK;
snl.nl_pid = getpid();
snl.nl_groups = 1;
uevent_netlink_sock = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_KOBJECT_UEVENT);
if (uevent_netlink_sock == -1) {
err("error getting socket: %s", strerror(errno));
return -1;
}
So we have udev sitting around up in userspace waiting for messages that come from the kernel (as a side note, if /sys/kernel/uevent_helper is filled in with a path that will be run and receive the events with environment variables set; this is useful in early boot before udev has started.
Thus we want to get a message out to udev that there is a new USB interface around, and it should try and load a module to bind to it.
We previously identified drivers/usb/core/message.c:usb_set_configuration() as reading the interfaces for a device and registering them with the driver core. When this registers the interface, it also registers a uevent helper drivers/usb/core/message.c:usb_if_uevent(). This function gets called by the Linux driver core when the driver is added into the driver hierarchy. It adds some information to the uevent:
if (add_uevent_var(envp, num_envp, &i, buffer, buffer_size, &length, "INTERFACE=%d/%d/%d", alt->desc.bInterfaceClass, alt->desc.bInterfaceSubClass, alt->desc.bInterfaceProtocol)) return -ENOMEM; if (add_uevent_var(envp, num_envp, &i, buffer, buffer_size, &length, "MODALIAS=usb:v%04Xp%04Xd%04Xdc%02Xdsc%02Xdp%02Xic%02Xisc%02Xip%02X", le16_to_cpu(usb_dev->descriptor.idVendor), le16_to_cpu(usb_dev->descriptor.idProduct), le16_to_cpu(usb_dev->descriptor.bcdDevice), usb_dev->descriptor.bDeviceClass, usb_dev->descriptor.bDeviceSubClass, usb_dev->descriptor.bDeviceProtocol, alt->desc.bInterfaceClass, alt->desc.bInterfaceSubClass, alt->desc.bInterfaceProtocol)) return -ENOMEM;
udev now has all the information required ask modprobe to load a module, if it can. The rule on my system looks something like:
# load the drivers
ENV{MODALIAS}=="?*", RUN+="/sbin/modprobe --use-blacklist $env{MODALIAS}"
Now, all going well, modprobe loads an appropriate module based on the vendor id, etc. Loading the driver causes a re-scan of the USB bus and the driver will be probed to see if wants to handle any of the unbinded USB devices (using the same procedure as was done setting up the root hub). It should take over control of the device, and that's it your USB device is up and running!
Simple!
Handy resources if you want to know more: Linux Device Drivers, USB2 Specification, UHCI Design Guide.
posted at: Fri, 20 Jul 2007 16:29 | in /code/linux | permalink | add comment (5 others)
When playing around with memory management a matrix multiply is often mentioned as a TLB thrasher, and a good way to stress your system. It's interesting to illustrate this process a little.
The first thing to remember is that C99 (6.5.2.1 to be precise) dictates that arrays are always stored in row-major order, as shown below.
So, remembering back to first year algebra, to multiply two matrices we multiply and sum rows and columns, illustrated linearly as the computer see it below. Black lines are page boundaries, while boxes are matrix elements.
The simplest code to do this is usually along the lines of
int a[LEN][LEN]; int b[LEN][LEN]; int r[LEN][LEN]; int i,j,k for(i=0; i < LEN; i++) for(j = 0; i < LEN; j++) for(k = 0; k < LEN; k++) r[i][j] += a[i][k] + b[k][j];
We can see how we have a repeated linear walk over the memory
holding matrix b, done for each element in
a. If we make b sufficiently large (say, a
few hundred pages) we can be sure that by the time we get to the end,
we have kicked out the earlier TLB entries. Then we start all over
again, and start faulting pages back in. Thus a matrix multiply
problem comes down to doing repeated page-sized stride walks over a
linear region.
Looking at the diagram, it is also clearer what can solve this problem. We can either have more TLB entries, so we don't have to kick out entries which we will use again, or have larger pages -- e.g. less black lines. Or a smarter algorithm ...
posted at: Wed, 10 Jan 2007 11:21 | in /code/linux | permalink | add comment (0 others)
Several people asked me to re-explain virtual linear page tables after my SLUG talk. So here it goes!
First, let's review the standard translation mechanism. We have a large address space, which we divide up into pages. We map pages to frames of physical memory.
The processor has a translation look-aside buffer or TLB which stores a cache of virtual page to physical frame mappings. It is small, but fast. Like any cache, it runs out and needs to be filled. The operating system does this from the page tables, which keep a virtual to physical mapping.
On Linux, we use a multi-level hierarchical page table, illustrated above. First you take some of the top bits, and use that as an index into the page global directory (PGD). You then follow that to a page middle directory entry (PMD), take more bits as an index into that page, and finally follow that to page table entries. Finally, you take another index to find the PTE entry, which points to the physical page the given virtual address is mapped to.
So far, so good. But this walking process is very slow. It requires a lot of memory references and hence cache traffic, and has to be performed at regular intervals.
What would be good is if we dispensed with the upper levels, as per the above diagram. So we just keep the PTE entries (which point directly to allocated physical pages) in one long array, and then simply use the VPN as an index into it. This would require much less effort.
The problem is that we can not find that many contiguous (next to each other) physical pages. Even if we could, the amount of physical memory we had to allocate to page tables would be huge! So this scheme won't work.
But we have a large virtual address space. Why not dedicate the top portion of that to be a linear array of PTE entries, and use the TLB to map the virtual pages full of PTE's to actual pages full of PTE's.
So when we ask for a virtual page, and it is not mapped by the TLB, we can use the virtual page number as an index into our virtually linear page table (note above we mention that we hash the virtual page number, it equates to the same thing as just using it as an index). We can now go directly to the PTE entry, and as long as the virtual page that PTE entry exists in has a mapping to the underlying physical page full of PTE entries, there is no need to walk the tree - the "virtual" PTE is now a pointer to the real thing! From there, it is a simple matter to read the PTE entry and find the physical frame for the virtual page we are looking for.
Of course, as with any engineering effort, there are trade-offs. Firstly, we have to sacrifice some of our virtual address space to the virtual linear table. We have lots of address space, so this is not too severe. The other trade-off is that you now have an extra TLB entry; as we mentioned the TLB is small, so taking up more entries can have a penalising effect. However, if you are walking linearly through pages, as a program usually does, you will not need to do a full page table walk for a whole page full of PTE entries and it only cost you one TLB entry; a significant win.
Itanium attempts to find the hashed PTE address in hardware, and calls is the virtually hashed page table walker or VHPT walker. If it can not resolve the situation, it returns control to the operating system, which will then have to do a full walk of the page tables. There's a few more details, but that's the broad strokes.
I hope that clears things up for a few people ...
... however if it does not, please feel free to modify the figures to make them clearer!
posted at: Fri, 09 Jun 2006 17:06 | in /code/linux | permalink | add comment (0 others)
Unless you are one of the core kernel developers, you always communicate your changes via patches posted to mailing lists. However, if your patch is going to take more than ten minutes to write you need to keep up to date with the upstream kernel source, because otherwise you'll end up hopelessly behind. The problem is that you always need to keep your changes on the top, because you need to be easily able to export them as a patch against the upstream tree at any point.
I call this being on the "edge" of kernel development -- there is
nobody further out from the middle than you. Generally, I have found
git and any of the wrappers quite painful, but thanks to
Matt Chapman I gave Mercurial
(hg) a go, and it is perfect.
Firstly, clone a new tree
hg clone http://www.kernel.org/hg/linux-2.6/
This tree you will never touch directly; it is a copy of where Linus is at any point in time. Next, clone a working directory where you will be doing your development.
hg clone linux-2.6 linux-2.6-working
Start doing your work on the linux-2.6-working tree;
make commits as necessary as you develop major parts of your work.
You will want to tag your first commit to make it easier to create diffs of your work. Use a local tag, because it is only for your reference within the local tree. So after your first commit do
hg tag -l kernel-import
You can then get all your changes against the kernel you have downloaded from with
hg diff kernel-import:tip
Now several hours/days/weeks later you need to update your changes against the latest upstream versions. Firstly, update the upstream tree
cd linux-2.6 hg pull
There shouldn't be any conflicts or issues, because you have not changed anything locally. Now create a new update tree, cloned from the latest upstream version (just as you did when you started).
hg clone linux-2.6 linux-2.6-update
Now pull into the update tree your development tree. This will (should) have the effect of just grabbing all your changesets.
cd linux-2.6-update hg pull ../linux-2.6-working
Now your project has multiple heads -- one from your old working tree and one from the new tree you are merging into. You will now need to merge the pulled changes into the new update tree. Do this with
hg update -m
Hopefully there won't be any conflicts, but if there are you will have to resolve them. Once done, the final step is to commit your changes into the new tree.
hg commit -m "merge to new linus tree"
Now create a local tag at the start of your changesets by looking
at hg log and using hg tag -l kernel-import
first-of-your-changesets-revision-number (is there a better way
to do this?).
hg tag kernel-import
So your changesets should now be on the the tip of the update tree. So the update tree becomes the new working tree, and you can archive the old working tree. You can export your patches (for sending to lists, etc) with
hg export kernel-import:tip
Now continue your work in the update tree you just created. Commit as much as you like doing all the development you require, at any time you can use the export from the tag you created to get a patch of your work.
Eventually you will need to re-sync with upstream again. At this point repeat the process; make a new update tree and pull your working tree into it. Archive the old working tree and continue development on the new 'update' tree.
posted at: Thu, 24 Nov 2005 15:20 | in /code/linux | permalink | add comment (0 others)
I've been hacking a lot of kernel headers lately, which gives you
plenty of time to think as the rebuild magic happens. So I thought
I'd try to come up to speed a bit more on git, the new
kernel SCM.
Which lead me to attempt a diagram explaining how things fit together. As usual with these things, I'm not sure if it makes things clearer or more confusing.
xfig source available if you want to make it better.
posted at: Wed, 16 Nov 2005 16:55 | in /code/linux | permalink | add comment (1 others)
I've had the opportunity to play with software RAID over the last few days, and one thing that confused the heck out of me was why I created my fresh RAID 5 array with a few disks, and from the start one of the disks came up missing.
Eventually I found in the mdadm man page
The mdadm code enlightens me a bit more
/* If this is raid5, we want to configure the last active slot * as missing, so that a reconstruct happens (faster than re-parity)
But I'm still wondering why. I think I understand from following
the code (if you know the code, please correct any mis-assumptions).
Firstly, md.c schedules a resync when it notices things
are out of sync, like when an array is created and the drives are not
marked in sync or when there is a missing drive and a spare.
md goes through the sectors in the disk one by one
(doing a bit of rate limiting) and calls sync_request for
the underlying RAID layer for each sector. For RAID5 this firstly
converts the sector into a stripe and sets some flags on that stripe
(STRIPE_SYNCING). Each stripe structure has a buffer for
each disk in the array, amongst other things (struct
stripe_head in include/linux/raid/raid5.h). It
then calls handle_stripe to, well, handle the stripe.
/* maybe we need to check and possibly fix the parity for this stripe
* Any reads will already have been scheduled, so we just see if enough data
* is available
*/
if (syncing && locked == 0 &&
!test_bit(STRIPE_INSYNC, &sh->state) && failed <= 1) {
set_bit(STRIPE_HANDLE, &sh->state);
if (failed == 0) {
char *pagea;
if (uptodate != disks)
BUG();
compute_parity(sh, CHECK_PARITY);
uptodate--;
pagea = page_address(sh->dev[sh->pd_idx].page);
if ((*(u32*)pagea) == 0 &&
!memcmp(pagea, pagea+4, STRIPE_SIZE-4)) {
/* parity is correct (on disc, not in buffer any more) */
set_bit(STRIPE_INSYNC, &sh->state);
}
}
if (!test_bit(STRIPE_INSYNC, &sh->state)) {
if (failed==0)
failed_num = sh->pd_idx;
/* should be able to compute the missing block and write it to spare */
if (!test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)) {
if (uptodate+1 != disks)
BUG();
compute_block(sh, failed_num);
uptodate++;
}
if (uptodate != disks)
BUG();
dev = &sh->dev[failed_num];
set_bit(R5_LOCKED, &dev->flags);
set_bit(R5_Wantwrite, &dev->flags);
locked++;
set_bit(STRIPE_INSYNC, &sh->state);
set_bit(R5_Syncio, &dev->flags);
}
}
We can inspect one particular part of that code that can get
triggered if we're running a sync. If we don't have a failed
drive (e.g. failed == 0) then we call into
compute_parity to check the parity of the current stripe
sh (for stripe header). If the parity is correct, then
we flag the stripe as in sync, and carry on.
However, if the parity is incorrect, we will fall out to the next
if statement. The first if checks if we have a failed
drive; if we don't then we flag the failed drive as the disk with the
parity for this stripe (sh->pd_idx). This is where the
optimisation comes in -- if we have a failed drive then we pass it in
directly to compute_block meaning that we will always be
bringing the "failed" disk into sync with the other drives. We then
set some flags so that the lower layers know to write out that
drive.
The alternative is to update the parity, which in RAID5 is spread across the disks. This means you're constantly stuffing up nice sequential reads by making the disks seek backwards to write parity for previous stripes. Under this scheme, for a situation where parity is mostly wrong (e.g. creating a new array) you just read from the other drives and bring that failed drive into sync with the others. Thus the only disk doing the writing is the spare that is being rebuilt, and it is writing in a nice, sequential manner. And the other disks are reading in a nice sequential manner too. Ingenious huh!
I did run this by Neil Brown to make sure I wasn't completely wrong, and he pointed me to a recent email where he explains things. If anyone knows, he does!
posted at: Thu, 07 Jul 2005 16:09 | in /code/linux | permalink | add comment (1 others)
remap_file_pages modifies an existing mmap of a file
to point to different pages. When you mmap a file, the first page in
memory points to the first page of the file on disk, the second page
in memory to the second page on disk, etc.
If you don't want this, e.g. you want the first page in
memory to refer to some other page, you would usually have to multiple
mmap operations. If you do this a lot, it can slow the
kernel down keeping track of it all.
Instead, mmap the file and then use
remap_file_pages to modify that mmap to point into
different places in the file. In the example below, we create a
temporary file, map it into memory and then "turn it around" ... that
is the first page in memory is the last page on the disk, and so
on.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <string.h>
#define DEFAULT_MMAP_SIZE (1024*1024*10) //10MB
#define PASSED 0
#define FAILED 1
static inline char page_hash(int page)
{
return (char)(page % 256);
}
static inline char *page_offset_to_addr(char *start, int page)
{
return start + (getpagesize() * page);
}
int genfile(char *file, const size_t size)
{
ssize_t bytes = 0;
int fd = 0;
char *buf;
int i, page;
if (!mkstemp(file))
return FAILED;
fd = open(file, O_RDWR);
if (fd == -1)
return FAILED;
buf = malloc( getpagesize() );
printf("Writing out %d pages to %s\n", (int)(size / getpagesize()), file);
for (page = 0 ; page < (size / getpagesize()) ; page++)
{
for (i = 0 ; i < getpagesize() ; i++)
buf[i] = page_hash(page);
bytes += write(fd, buf, getpagesize());
}
close(fd);
sync();
return PASSED;
}
int compare(char *remapped_file, char *file, unsigned long size)
{
char *mmap_orig_file;
int i = 0, fd = open(file, O_RDONLY);
int err = FAILED;
if (!remapped_file || fd == -1)
return FAILED;
/* map in the file from disk, again */
if ((mmap_orig_file =
mmap(0, size, PROT_READ, MAP_SHARED, fd,
0)) == MAP_FAILED) {
goto out_mmap_fail;
}
/* walk the original backwards and compare it to the remapped
* file going forwards, page by page. they should be the
* same.
*/
int cur_remap_page = 0;
int cur_orig_page = (size / getpagesize()) - 1;
while (cur_orig_page >= 0)
{
printf("compare %05d -> %05d\r", cur_remap_page, cur_orig_page);
if ((i = memcmp(page_offset_to_addr(mmap_orig_file, cur_orig_page),
page_offset_to_addr(remapped_file, cur_remap_page),
getpagesize()) != 0)) {
err = FAILED;
goto out;
}
cur_remap_page++;
cur_orig_page--;
}
printf("\n");
err = PASSED;
out:
munmap(mmap_orig_file, size);
out_mmap_fail:
close(fd);
return err;
}
int main(void)
{
int fd;
const char *tmp_default = "/tmp/remap-testXXXXXX";
size_t size = DEFAULT_MMAP_SIZE;
char file[256], *addr;
int err = FAILED;
int sys_pagesize = getpagesize();
strcpy(file, tmp_default);
genfile(file, size);
/*
* Map in the file
*/
fd = open(file, O_RDWR);
if ((addr =
mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
0)) == MAP_FAILED)
goto out;
/* Turn the file around with remap_file_pages; that is the
* last page of the file on disk becomes the first page in
* memory, the second last page the second page in memory,
* etc.
*/
int cur_mmap_page = (size / sys_pagesize) - 1;
int cur_file_page = 0;
while (cur_mmap_page >= 0)
{
remap_file_pages(page_offset_to_addr(addr, cur_mmap_page),
sys_pagesize, 0, cur_file_page, 0);
cur_mmap_page--;
cur_file_page++;
}
err = compare(addr, file, size);
if (err == FAILED)
printf("Test Failed!\n");
else
printf("Test Passed!\n");
out:
close(fd);
unlink(file);
exit(err);
}
posted at: Mon, 02 May 2005 11:42 | in /code/linux | permalink | add comment (0 others)
If you're planning on attending the kernel tute at LCA05 but have an Apple laptop, you might be interested in my cross compiler debian packages.
I would highly recommend you don't try try this yourself :) Originally I planned to write a little note on how to do it, but after about the third hour I'd forgotten what I'd hacked to get to the point I was at. If you do try, don't bother with gcc-3.4; it has some sort of heisenbug that segfaults the assembler in various ways. And just getting C and not g++/ada/pascal/java/etc was also harder than it should be.
Anyway, with the packages installed you need to build the kernel with two extra flags.
$ make CROSS_COMPILE=i386-linux- ARCH=i386
Other than that, the built binary boots fine for me in qemu on my iBook!
update: If you want to compile userspace apps, you will need
to use tpkg-install-libc from
toolchain-source.
posted at: Sat, 16 Apr 2005 17:37 | in /code/linux | permalink | add comment (0 others)
You probably already know that a floating point number is
represented in binary as sign * significand *
radixexponent. Thus you can represent the
number 20.23 with radix 10 (i.e. base 10) as 1 * .2023 *
102 or as 1 * .02023 *
103.
Since one number can be represented in different ways, we define we
defined the normalised version as the one that satisfies
1/radix <= significand < 1. You can read that as saying
"the leftmost number in the significand should not be zero".
So when we convert into binary (base 2) rather than base 10, we are saying that the "leftmost number should not be zero", hence, it can only be one. In fact, the IEEE standard "hides" the 1 because it is implied by a normalised number, giving you an extra bit for more precision in the significand.
So to normalise a floating point number you have to shift the signifcand left a number of times, and check if the first digit is a one. This is something that the hardware can probably do very fast, since it has to do it a lot. Combine this with an architecture like IA64 which has a 64 bit significand, and you've just found a way to do a really cool implementation of "find the first bit that is not zero in a 64 bit value", a common operation when working with bitfields (it was really David Mosberger who originally came up with that idea in the kernel).
#define ia64_getf_exp(x) \
({ \
long ia64_intri_res; \
\
asm ("getf.exp %0=%1" : "=r"(ia64_intri_res) : "f"(x)); \
\
ia64_intri_res; \
})
int main(void)
{
long double d = 0x1UL;
long exp;
exp = ia64_getf_exp(d);
printf("The first non-zero bit is bit %d\n", exp - 65535);
}
Note the processor is using an 82 bit floating point implementation, with a 17 bit exponent component. Thus we use a 16 bit (0xFFFF, or 65535) bias so we can represent positive and negative numbers (i.e, zero is represented by 65535, 1 by 65536 and -1 by 65534) without an explicit sign bit.
IA64 uses the floating point registers in other interesting ways
too. For example, the clear_page() implementation in the
kernel spills zero'd floating point registers into memory because that
provides you with the maximum memory bandwidth. The libc
bzero() implementation does a similar thing.
posted at: Mon, 11 Apr 2005 13:10 | in /code/linux | permalink | add comment (0 others)
A relocation is simply a record that stores
ELF defines two types of relocations
typedef struct {
Elf32_Addr r_offset; <--- address to fix
Elf32_Word r_info; <--- symbol table pointer and relocation type
} Elf32_Rel
typedef struct {
Elf32_Addr r_offset;
Elf32_Word r_info;
Elf32_Sword r_addend;
} Elf32_Rela
Thus RELA relocations have an extra field, the addend.
ianw@baci:~/tmp/fptest$ cat addendtest.c extern int i[4]; int *j = i + 2; ianw@baci:~/tmp/fptest$ cat addendtest2.c int i[4]; ianw@baci:~/tmp/fptest$ gcc -nostdlib -shared -fpic -s -o addendtest2.so addendtest2.c ianw@baci:~/tmp/fptest$ gcc -nostdlib -shared -fpic -o addendtest.so addendtest.c ./addendtest2.so ianw@baci:~/tmp/fptest$ readelf -r ./addendtest.so Relocation section '.rela.dyn' at offset 0x3b8 contains 1 entries: Offset Info Type Sym. Value Sym. Name + Addend 0000000104f8 000f00000027 R_IA64_DIR64LSB 0000000000000000 i + 8
So, sizeof(int) == 4 and j points two ints into i
(i.e. i + 8). R_IA64_DIR64LSB is just defined as SYMBOL + ADDEND which
means that to fixup this relocation, we need to write to Offset the
value of i (which we find and resolve) and add 8 to it.
If you try this on a 386, you will not have the explicit addend as it will use a REL relocation. You will actually have to read the memory at Offset to find the addend (it will be 8).
Having to read the memory before you do the relocation has all sorts of inefficiencies, and the only positive is that it saves space (because with RELA you have that extra field and "blank" spot waiting to be filled in the binary; with REL you keep that addend in the "blank spot"). Most modern architectures have dispensed with REL relocations all together (IA64, PPC64, AMD64).
posted at: Thu, 07 Apr 2005 09:15 | in /code/linux | permalink | add comment (0 others)
Since 2.6 it seems that the distinction between the "stable" series and "unstable" series has pretty much disappeared. As someone who needs to keep up with current developments, this is often a real pain as it is like trying to build a house on quicksand. But one thing I didn't think of was that this increases compatibility; as evidenced by the model doing exactly the opposite ... in this post Andrew Morton suggests that suse may have caused two interfaces to the same thing due to releasing before something was accepted into the kernel.org sources.
So there's one advantage to having the moving releases as we do now -- everyone has no excuses not to keep pushing their stuff into the official trees.
posted at: Thu, 24 Feb 2005 15:08 | in /code/linux | permalink | add comment (0 others)
Apart from the obvious 32-64 bit distinction between 386 and AMD64 there are two other interesting comparisons; parameter passing conventions and position independent code conventions.
Parameter Passing : on x86 parameters are passed via the stack. On AMD64 the first six "integer" arguments (anything that fits in a 64 bit register, basically) are passed via registers, similarly some floats can be passed via SSE registers. Only after this is data passed on the stack. On IA64, the first 8 arguments are passed in registers, whilst the rest are put on the stack.
On both AMD64 and IA64, there is a extra 16 byte "scratch area" (IA64) / 128 byte "red zone" (AMD64) that is below at the bottom of current stack frame. I would suggest that the smaller IA64 scratch area size is because of register windowing, which AMD64 does not support. On both architectures this is reserved and not modified by signal or interrupt handlers. "Leaf functions" (functions that do not call other functions) can use this area as their entire stack frame; saving some considerable overhead.
For varargs functions causes some confusion for
AMD64/IA64, since arguments might be floats or might be integers,
meaning they should be passed in either general or float/SSE registers
respectively. On AMD64, functions known to be varargs
functions should have a prologue that saves all arguments to a
"register save area" that has a known layout (you pass the maximum
number of possible floating point args as well to avoid saving
unnecessary registers). Then, as you use the va_arg
macro to go through the arguments you grab them from the register save
area. On IA64, you assume that the first 8 arguments are passed in
via the stack, and save these registers to your scratch area (2
registers) and 48 bytes of your stack (remaining 6 registers). This
means all your arguments are stacked together (the incoming parameter
list sits up against the scratch area) and va_arg can
simply "walk" upwards.
Undefined functions are a bit more tricky; IA64 suggests that if a float is passed into a function with an undefined parameter, it should be copied to both the first general purpose register and the first floating point register, just to be safe. AMD64 doesn't seem to make such assumptions for you, for example, on IA64
ianw@lime:/tmp$ cat function.c
void function(float f)
{
printf("%f\n", f);
}
ianw@lime:/tmp$ cat test.c
extern function();
int main(void)
{
float f = 10000.01;
function(f);
}
ianw@lime:/tmp$ gcc -o test test.c function.c
ianw@lime:/tmp$ ./test
10000.009766
That same code on AMD64 returns 0.
IP relative addressing: Position Independent Code (PIC) is code that can be loaded anywhere into memory and work. This is important because shared libraries may not always be at the same address, since other shared libraries might be loaded before or after them, etc. To maintain position independence, you can't rely on the base address of any code (because it might change) so you add a layer of indirection between your calls. In Linux/ELF land this is done with a Global Offset Table (GOT).
You can think of the GOT as a big list two columned list that has a symbol and it's "real address". Thus, instead of loading the symbol directly, you load the value from the GOT, and then load that value to find the real thing.
Note, you always know the relative address of the GOT,
because although the base address might change, the difference between
your code and where the GOT is will not. This means that if you need
to load an address from the GOT, the easiest way is to load via an
offset from the current instruction from the GOT entry. The compiler
knows the current instruction offset (note it can't know the current
instruction address, because the binary might be anywhere in
memory), so it wants to say load the address at
(CURRENT_INSTRUCTION - OFFSET_TO_GOT_ENTRY).
386 just can't do this -- there is no way to load an offset from
the current instruction pointer. The only way you can do it is to
keep a pointer to the GOT in a register (%ebp), and then
offset from that. This wastes a whole register, and when you only
have a few like the 386 this is a big killer.
AMD64 fixes this and allows you to offset from the current instruction pointer. This frees up a register, and changes the ABI by removing the distinction between the Absolute PLT and PIC PLT.
The PLT is a further enhancement that facilitates lazy binding. The PLT is "stubs" that point to a fix up function in the dynamic loader. At first, the GOT entries for functions point to the PLT entry for that function.
When you call the function, you don't go directly to it, you load it's value via the GOT and then jump to that value. As mentioned, at first this points to the PLT stub. This calls the lookup function in the dynamic loader which goes off and finds the real function (this might actually be in another shared library that needs to be loaded, for example). As arguments to this lookup function you pass the function name you're looking for (obviously) and the GOT entry of the original call. The dynamic loader finds the function, but then additionally fixes up the GOT entry to no longer point to the PLT stub, but to point directly to the required function. This means the next time you load from the GOT, you get the direct address of the function without the overhead of the PLT stub again.
IA64, allowing IP relative addressing, similarly doesn't have a distinction between absolute and PIC PLT's.
posted at: Tue, 22 Feb 2005 09:45 | in /code/linux | permalink | add comment (3 others)
Other questions that come up about threads
How many threads can I run? This depends on a number of things
ulimit of the user. Set the number of threads
with ulimit -u (or somewhere like
/etc/security/limits.conf).pthread_attr_setstacksize() allows you use smaller stacks.kernel/fork.c:fork_init()
/* * The default maximum number of threads is set to a safe * value: the thread structures can take up at most half * of memory. */ max_threads = mempages / (8 * THREAD_SIZE / PAGE_SIZE);
What does ps show me? By default, ps should
only show you the parent thread. Try with -m for the
child threads.
posted at: Thu, 10 Feb 2005 15:47 | in /code/linux | permalink | add comment (0 others)

This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.