Powershell quicktip: Find in files recursive

This is just a simple Powershell snippet that makes it easier to search for a specific string in all files in a directory:

Function FindRec($filepattern, $pattern) {
	Get-ChildItem -Recurse -Include $filepattern | Select-String -Pattern $pattern | Group path | Select Name, Count | Sort Count
}

Example input/output:

C:\Development\Projects> FindRec *.php array_sum
Name                                             Count
----                                             -----
C:\Development\Projects\Alex\helpers.php         4
C:\Development\Projects\Alex\math.php            1

Harvest – a C# multithreaded web crawler

Looking at my blog statistics, it seems that web crawling and parsing HTML is very popular. Thanks to this, I’m open sourcing Harvest, a C# multithreaded web crawler. With a lightweight and flexible architecture, it makes common crawling tasks easy.

While it’s a work in progress, Harvest is already used in a production environment to crawl a decent number of domains.

If you find Harvest interesting, please drop me a comment. Contributions are always welcome.

Harvest can be found on github: https://github.com/alexandernyquist/Harvest/

Example

Here’s just an example of what Harvest can do. The following snippet crawls my blog for all outbound links. Note the use of ExcludeHostsExcept-filter. Without it, Harvest would follow all external links.

public class Program
{
    public static void Main()
    {
        var crawler = new Crawler
        {
            ExcludeFilters = new IExcludeFilter[]
            {
                new ExcludeImagesFilter(),
                new ExcludeTrackbacks(),
                new ExcludeMailTo(),
                new ExcludeHostsExcept(new[] { "nyqui.st" }),
                new ExcludeJavaScript(),
                new ExcludeAnchors(),
            }
        };
 
        crawler.OnCompleted += () =>
        {
            Console.WriteLine("[Main] Crawl completed!");
            Environment.Exit(0);
        };
 
        crawler.OnPageDownloaded += page =>
        {
            Console.WriteLine("[Main] Downloaded page {0}", page.Url);
 
            // Write external links
            foreach (var link in page.Links)
            {
                if (link.TargetUrl.Host != page.Url.Host)
                {
                    Console.WriteLine("Found outbound link from {0} to {1}", page.Url, link.TargetUrl);
                }
            }
        };
 
        crawler.Enqueue(new Uri("http://nyqui.st"));
        crawler.Start();
 
        Console.WriteLine("[Main] Crawler started.");
        Console.WriteLine("[Main] Press [enter] to abort.");
        Console.ReadLine();
    }
}

Wrapper around DOMDocument for more convenient work with HTML

This is mostly a reminder to myself. Don’t bother to work around PHPs DOMDocument limitations every time you’re dealing with HTML. Use this class. This class forces UTF-8 when loading a document. It does not output any doctype or HTML markup tags when saving. Sample usage:

$dom = new HtmlDocument();
$content = '<p>Detta är ett test<p>';
$dom->loadHTML($content);
$html = $dom->saveHTML();
assert($html === '<p>Detta är ett test</p>'); // true
print $html;
class HtmlDocument extends DOMDocument {
	public function loadHTML($source, $options = 0) {
		/*
		* Force a UTF-8.
		* Encapsulate $source inside a div so we can remove doctype
		* etc. when saving.
		*/
		$source = sprintf('<?xml encoding="UTF-8"><div>%s</div>', $source);
		parent::loadHTML($source, $options);
	}
 
	public function saveHTML(DOMNode $node = NULL) {
		if($node != null) {
			return parent::saveHTML($node);
		}
 
		/*
		* Exclude everything outside the first div. This effectively removes
		* any doctype and html declarations.
		*/
		$root = parent::getElementsByTagName('div')->item(0);
		$xml = parent::saveXML($root);
		return mb_substr($xml, 5, -6);
	}
}

Firefox 16 Developer Toolbar – Screenshot

Today, Mozilla launched Firefox 16. This release contained, for us developers, a shiny new developer toolbar. One feature that I’m particular excited about is the screenshot feature. It lets you take screenshot of the whole page or a specific element.

First, fire up the new toolbar. This is done by either navigating through Tools -> Web Developer -> Developer Toolbar.

The toolbar will open up and position itself at the very bottom:

Firefox 16 take screenshot

Capture the whole page

To take a screenshot, simply execute the screenshot command. For example, if we want to capture the whole page, type:

screenshot full.png 0

This specifies that we want a screenshot taken after 0 (zero) seconds saved to full.png. As you understand, we can also wait ten seconds before capturing:

screenshot full.png 10

When the screenshot is completed, you’ll get a nice popup telling you the location of the saved file.

Note: Only .png files are supported. Every filename must therefore end in .png.

Capturing a specific element

We can also capture a specific element. This is done by providing a css selector as the fourth argument. When typing this command, note that it won’t let you capture elements that does not exist. Very helpful.

screenshot sidebar.png 0 false #primary

This will capture the contents of my sidebar into a file named sidebar.png.

Note that i added a third argument, false. This controls whether the screenshot should also include parts of the webpage which are outside the current scrolled bounds. It defaults to false.

Remember that you can always type “help <command>” to get more information about a particular command.


XPath – Utilities to get them

When parsing HTML, the recommended way is to use XPath queries or dom traversal to get the desired elements. Getting the XPath to a specific set of element can often involve a bit of trial and error. However, both Firebug and Chrome Developer Tools offers utilities to help this process.

Using Firebug, simply select the node you want to get using inspect, then right click the selected node and click “Copy XPath”:

The process is the exactly same in Chrome Developer Tools:

Both ways will copy a XPath selector into your clipboard.

Scraping Google Search Results

Let’s take another example. We want to extract the link to all sites for a regular Google Search.

First, issue a regular search. Then open up Firebug or Chrome Developer Tools. Right click on the a-element for any SERP item and click “Copy XPath”. I got*

/html/body/div[4]/div[2]/div/div[4]/div[2]/div[2]/div[2]/div[2]/div/ol/li[5]/div/h3/a

Evaluate this XPath in your console using the $x() method (both Firebug and Chrome supports this). That will yield one element.

The next step is to generalize this query to match all elements. We know that each serp item is a list and our query contains li[5] (I did choose the fifth SERP item). We can simply remove the indexer ([5]) to get all list items. Our query is now:


/html/body/div[4]/div[2]/div/div[4]/div[2]/div[2]/div[2]/div[2]/div/ol/li/div/h3/a

This will give us all a-elements. However, we want the actual URL. We can apply an attribute selector to retrieve this:

/html/body/div[4]/div[2]/div/div[4]/div[2]/div[2]/div[2]/div[2]/div/ol/li[5]/div/h3/a/@href


As we can see, this query does return the href-attribute for all SERP items. Easy, right?

What more?

I recommend using XPath Checker for FireFox to simplify this process even further. This extension makes it so much easier to evaluate queries and see the matching results.

If you want to do this programatically from C#, use HtmlAgilityPack. It’s a powerful library which makes this a breeze. See my previous article on Parsing HTML with C# for example on how to do it.

If you’re using PHP, a combination of DOMDocument and DOMXPath will do the job.

*Note that Googles markup can change depending on which datacenter you end up on.


Setting up MVC3 with Structuremap

ASP.NET MVC 3 introduced the concept of DependencyResolver, which enables us to inject our own types into controllers, action filters, view engines and so forth. Here is a implementation of DependencyResolver for Structuremap:

public class StructuremapDependencyResolver : IDependencyResolver {
        private readonly IContainer _container;
 
        public StructuremapDependencyResolver(IContainer container) {
            _container = container;
        }
 
        public object GetService(Type serviceType) {
            if (serviceType.IsAbstract || serviceType.IsInterface) {
                return _container.TryGetInstance(serviceType);
            }
 
            return _container.GetInstance(serviceType);
        }
 
        public IEnumerable<object> GetServices(Type serviceType) {
            return _container.GetAllInstances<object>().Where(x => x.GetType() == serviceType);
        }
    }

With our new DependencyResolver, a custom controller factory is no longer needed.

You tell MVC3 to use our DependencyResolver by registering it in Application_Start (in global.asax, of course):

protected void Application_Start() {
        ...
 
        var container = new Container(new DependencyRegistry());
        DependencyResolver.SetResolver(new StructuremapDependencyResolver(container));
 
        ...
}

The actual container is set up as normal, with registries passed into it via the constructor.

Happy coding!


Operation could destabilize the runtime

“Operation could destabilize the runtime” is an error i’ve been getting lately when working with both RavenDB and ncqrs. The problem is because of Json.NET and IntelliTrace (Visual Studio 2010 Ultimate) not working nice together. A solution is to disable IntelliTrace for all assemblies in the Json.NET. This is done by:

Debug -> Options and Settings -> IntelliTrace -> Modules -> Add *Newtonsoft*

While this solved my issues with RavenDB and ncqrs, i’m sure it applies to all libraries/applications using Json.NET.

Hope this helps.


Massive data access for MySQL

Massive is a tiny, single-file, data access wrapper around SQL Server. It makes heavy use of System.Dynamic which means that it’s very flexible. It’s really a wrapper which does not get in your way. No extra assemblies, just as single file which you can drop in your solution.

Since Massive is SQL Server only and I work with MySQL, I’ve forked it to add MySQL support.

You can find the repository over at GitHub. There is also a NuGet package available, named Massive.MySQL which has MySQL.Data as a dependency. This means that you only need to do the following to get a working data access wrapper:

Install-Package Massive.MySQL

This will download Massive.MySQL, add it to your project, and download MySQL.Data (which is the only dependency).

I’m quoting the original readme for usage:

Let’s say we have a table named “Products”. You create a class like this:

public class Products:DynamicModel {
    public Products() : base("northwind") {
        PrimaryKeyField = "ProductID";
    }
}

Now you can query thus:

var table = new Products();
//grab all the products
var products = table.All();
//just grab from category 4. This uses named parameters
var productsFour = table.All(columns: "ProductName as Name", where: "WHERE categoryID=@0",args: 4);

You can also run ad-hoc queries as needed:

var result = tbl.Query("SELECT * FROM Categories");

Of course, Massive also supports updates and inserts:

var table = new Products();
var poopy = new {ProductName = "Chicken Fingers"};
//update Product with ProductID = 12 to have a ProductName of "Chicken Fingers"
table.Update(poopy, 12);
//pretend we have a class like Products but it's called Categories
var table = new Categories();
//do it up - the new ID will be returned from the query
var newID = table.Insert(new {CategoryName = "Buck Fify Stuff", Description = "Things I like"});

For more information, please see the README file.

Thank you Rob for your efforts put into this, it’s a great tool.


Json.NET / Newtonsoft.Json lowercase keys

This post is just a quick tip on how to serialize keys in lowercase using Json.NET. The secret is to use a custom ContractResolver. Definition:

public class LowercaseContractResolver : DefaultContractResolver {
    protected override string ResolvePropertyName(string propertyName) {
        return propertyName.ToLower();
    }
}

Usage:

var settings = new JsonSerializerSettings();
settings.ContractResolver = new LowercaseContractResolver();
var json = JsonConvert.SerializeObject(authority, Formatting.Indented, settings);

This will ensure that all keys are in lowercase, even when properties are camelCased etc.


Google Analytics API in C#

Edit: I’m not satisfied at all by the API below. Hopefully i will find time to give it some love, perhaps using dynamics.

I’m not that fond of the official GData library and especially not the Analytics wrapper. Since i only needed a small subset of the function of all available features, i’ve developed my own library.

Let me introduce AnalyticsApi.

AnalyticsApi is still in a very early stage but it’s used in production in one of our projects at work. My goal is to create a easy to use, strongly typed library. It should be both unit- and integration testable.

It has a fluent, easy to use, interface:

var dashboard = new AnalyticsService(new AnalyticsDataProvider(HttpWrapper.Standard))
                .Username("username@gmail.com")
                .Password("password")
                .Logon()
                .Profile(1312231)
                .GetDashboard("2011-01-01", "2011-01-30");
 
Console.WriteLine("PageViews: {0}", dashboard.PageViews);
Console.WriteLine("Bouncrate: {0}", dashboard.Bouncrate);
Console.WriteLine("Visits: {0}", dashboard.Visits);

The AnalyticsService has a dependency on a AnalyticsDataProvider which allows you to mock the results and test it in isolation.

Each request is defined as a request map, similar to FluentNHibernate. A request for dashboard data would be defined as the following:

class DashboardApiMap : ApiMap<DashboardRequest>
    {
        public DashboardApiMap()
        {
            Map(x => x.Visits, "ga:visits");
            Map(x => x.PageViews, "ga:pageviews");
            Map(x => x.Bounces, "ga:bounces");
            Map(x => x.Entrances, "ga:entrances");
            Map(x => x.TimeOnSite, "ga:timeonsite");
            Map(x => x.NewVisits, "ga:newvisits");
 
            Map(x => x.StartDate, "startDate", ElementLevel.FeedLevel);
            Map(x => x.EndDate, "endDate", ElementLevel.FeedLevel);
        }
    }

DashboardRequest is just a simple POCO with some methods for calculation for example calculating the bounce rate.

One thing i’m aming to fix pretty soon is rate limit checks and make it even easier to define new requests. I’m thinking about using extension methods for this. The AnalyticsService should also implicitly use the “live” AnalyticsDataProvider if nothing else is specified.

This library is far from finished, but i’m going to give it some love pretty soon.

Check out AnalyticsApi at GitHub.